A MACHINE LEARNING APPROACH FOR GENE EXPRESSION ANALYSIS
AND APPLICATIONS
by
THANH NGOC LE
B.S., University of Econimics in Hochiminh City, Vietnam, 1994
M.S., National University in Hochiminh City, Vietnam, 2000
A thesis submitted to the
Faculty of the Graduate School of the
University of Colorado in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Computer Science and Information Systems
2013
This thesis for the Doctor of Philosophy degree by
Thanh Ngoc Le
has been approved for the
Computer Science and Information Systems Program
by
Gita Alaghband, Chair
Tom Altman, Advisor
Katheleen Gardiner, Co Advisor
James Gerlach
Sonia Leach
Boris Stilman
Date: April 10, 2013
Le, Thanh, Ngoc (Ph.D., Computer Science and Information Systems)
A Machine Learning approach for Gene Expression analysis and applications
Thesis directed by Professor Tom Altman
ABSTRACT
Highthroughput microarray technology is an important and revolutionary technique
used in genomics and systems biology to analyze the expression of thousands of genes
simultaneously. The popular use of this technique has resulted in enormous repositories
of microarray data, for example, the Gene Expression Omnibus (GEO), maintained by
the National Center for Biotechnology Information (NCBI). However, an effective
approach to optimally exploit these datasets in support of specific biological studies is
still lacking. Specifically, an improved method is required to integrate data from
multiple sources and to select only those datasets that meet an investigator's interest. In
addition, to exploit the full power of microarray data, an effective method is required to
determine the relationships among genes in the selected datasets and to interpret the
biological meanings behind these relationships.
To address these requirements, we have developed a machine learning based approach
that includes:
• An effective metaanalysis method to integrate microarray data from multiple
sources; the method exploits information regarding the biological context of
interest provided by the biologists.
in
• A novel and effective cluster analysis method to identify hidden patterns in
selected data representing relationships between genes under the biological
conditions of interest.
• A novel motif finding method that discovers, not only the common transcription
factor binding sites of coregulated genes, but also the miRNA binding sites
associated with the biological conditions.
• A machine learningbased framework for microarray data analysis with a web
application to run common analysis tasks on online.
The form and content of this abstract are approved. I recommend its publication.
Approved: Tom Altman
IV
DEDICATION
I dedicate this work to my family, my wife, Lan, and my beloved daughter, Anh for
their constant support and unconditional love.
I love you all dearly.
ACKNOWLEDGMENTS
I would like to thank my advisor, Dr. Tom Altman for his guidance and support
throughout my graduate studies. His way of guidance helped me to develop my self
confidence and skills necessary to address not only research problems, but also real
world problems. Thank you very much Dr. Altman.
I would like to express my sincere gratitude to my coadvisor, Dr. Katheleen Gardiner
for her continuous support on my research work throughout the years. She guided me
with remarkable patience and understanding. She was always there to help me
whenever I encounter an obstacle or require an advice. Her inputs were very useful. It
was really a pleasure working with her.
I would like to thank Dr. Sonia Leach for her help in solving difficult problems and for
her valuable inputs as a member of my dissertation committee. She was very generous
in transferring her vast knowledge to us. Also, I would like to thank Dr. Gita
Alaghband, Dr. James Gerlach, and Dr. Boris Stilman for serving on my dissertation
committee and for their helpful suggestions and comments for the improvement of this
project; your discussion, ideas, and feedback have been absolutely invaluable. I would
like to thank you all for giving me an example of excellence as a researcher, mentor,
instructor, and role model.
I would like to thank my fellow graduate students, my lab mates and everyone who
helped me in many ways to make my graduate student life a memorable one. I am very
grateful to all of you.
VI
I would especially like to thank my amazing family for the love, support, and constant
encouragement I have gotten over the years.
In particular, I would like to thank my mother, my brothers and sisters. You are the salt
of the earth, and I undoubtedly could not have done this without you.
Vll
TABLE OF CONTENTS
List of Tables xii
List of Figures xv
Chapter
1. Introduction 1
1.1 Gene expression data analysis challenges 1
1.2 Motivations and proposed methods 3
1.2.1 Microarray metaanalysis 3
1.2.2 Cluster analysis of gene expression data 6
1.2.3 DNA sequence analysis 8
1.3 Research contributions 9
2. Microarray analysis background 13
2.1 Microarray 13
2.1.1 Microarray platforms 13
2.1.2 Gene expression microarray data analysis 16
2.2 Microarray data metaanalysis 18
2.2.1 Microarray data integration 18
2.2.2 Issues with microarray data metaanalysis 23
2.3 Cluster analysis 32
2.3.1 Clustering algorithms 34
2.3.2 Hierarchical clustering algorithms 35
2.3.3 KMeans clustering algorithm 36
2.3.4 The mixture model with expectationmaximization algorithm 37
2.3.5 Fuzzy clustering algorithm 39
vm
2.4 Gene regulatory sequence prediction 42
2.5 Datasets 45
2.5.1 Artificial datasets for cluster analysis 45
2.5.2 Artificial datasets for sequence analysis 46
2.5.3 Real clustering datasets 46
2.5.4 Gene expression datasets 48
2.5.5 CMAP datasets 49
2.5.6 Real biological sequence datasets 49
3. Fuzzy cluster analysis using FCM 53
3.1 FCM algorithm 53
3.2 FCM convergence 57
3.3 Distance measures 58
3.4 Partially missing data analysis 64
3.5 FCM Clustering solution validation 68
3.5.1 Global validity measures 70
3.5.2 Local validity measures 80
3.6 FCM partition matrix initialization 85
3.7 Determination of the number of clusters 87
3.7.1 Betweenclusterpartition comparison 88
3.7.2 Withinclusterpartition comparison 89
3.8 Determination of the fuzzifier factor 90
3.9 Defuzzification of fuzzy partition 92
4. Methods 98
4.1 Microarray metaanalysis 98
IX
4.1.1 Selection of samples and features 98
4.1.2 Sample and feature selection method 98
4.1.3 Methods for feature metric 108
4.2 Fuzzy cluster analysis methods using Fuzzy CMeans (FCM) 109
4.2.1 Modification of the objective function 110
4.2.2 Partition matrix initialization method Ill
4.2.3 Fuzzy clustering evaluation method 121
4.2.4 Fuzzy clustering evaluation using Gene Ontology [180] 1 29
4.2.5 Imputation methods for partially missing data 137
4.2.6 Probability based imputation method (fzPBI) [154] 139
4.2.7 Density based imputation method (fzDBI) [155] 149
4.2.8 Probability based defuzzification (fzPBD) [156] 153
4.2.9 Fuzzy genetic subtractive clustering method (f zGASCE) [157] 161
4.3 Motif finding problem 172
4.3.1 HIGEDA algorithm 172
4.3.2 New motif discovery using HIGEDA 192
5. Applications 196
5.1 Recovery of a gene expression signature 196
5.2 Genedrug association prediction 199
5.2.1 Model design 199
5.2.2 Model testing 205
5.3 Drug target prediction 216
5.4 Application of HIGEDA into prediction regulatory motifs 217
6. Conclusions and future work 218
References 221
XI
LIST OF TABLES
Table
3.1 : Performance of different distance measures on the Iris dataset 63
3.2 : Performance of different distance measures on the Wine dataset 63
3.3 : Average results of 50 trials using incomplete Iris data [140] 68
3.4 : Performance of three standard algorithms on ASET1 86
3.5 : Algorithm performance on ASET4 using MISC measure 96
3.6 : Algorithm performance on ASET5 using MISC measure 96
4.1 : Predicted HD AC antagonists 102
4.2 : Predicted HDAC agonists 103
4.3 : Predicted estrogen receptor antagonists 105
4.4 : Predicted estrogen receptor agonists 107
4.5 : Algorithm performance on the ASET1 dataset 118
4.6 : fzSC correctness in determining cluster number on artificial datasets 119
4.7 : fzSC performance on real datasets 120
4.8 : Fraction of correct cluster predictions on artificial datasets 125
4.9 : Validation method performance on the Iris dataset (3 true clusters) 125
4.10 : Validation method performance on the Wine dataset (3 true clusters) 126
4.11 : Validation method performance on the Glass dataset (6 true clusters) 126
4.12 : Validation method performance on the Yeast dataset (5 true clusters) 127
4.13 : Validation method performance on the YeastMIPS dataset (4 true
clusters) 127
4.14 : Validation method performance on the RCNS dataset (6 true clusters) 128
4.15 : Degrees of belief of GO annotation evidences 132
4.16 : Gene Ontology evidence codes 133
Xll
4.17 : Validation method performance on the Yeast dataset using GOBP 135
4.18 : Validation method performance on the YeastMIPS dataset using GOBP.... 136
4.19 : Validation method performance on the RCNS dataset using GOCC 136
4.20 : Validation method performance on the RCNS dataset using GOBP 137
4.21 : Average results of 50 trials using an incomplete IRIS dataset with
different percentages (%) of missing value 145
4.22 : ASET4 Compactness measure 157
4.23 : IRIS: Compactness measure 159
4.24 : WINE Compactness measure 160
4.25 : Performance of GA algorithms on ASET2 168
4.26: Performance of GA algorithms on ASET3 169
4.27 : Performance of GA algorithm on ASET4 169
4.28 : Performance of GA algorithms on ASET5 170
4.29 : Performance of GA algorithms on IRIS 171
4.30 :Performanc of GA algorithms on WINE 171
4.31 : Average performance (LPC/SPC) on simulated DNA datasets 187
4.32 : Average performance (LPC/SPC/run time (seconds)) on eight DNA
datasets (# of sequences/length of motif/# of motif occurrences) 188
4.33 : Detection of protein motifs (18, PFAM; 912, Prosite) 190
4.34 : Motifs of ZZ, Myb and SWIRM domains by the four algorithms 193
4.35 : Motifs of Presenilin1 and Signal peptide peptidase by HIGEDA 194
5.1 : Clustering results with estrogen significantly associated drugs 197
5.2 : Estrogen signature and the 'signature' predicted by FZGASCE 198
5.3 : Hsa21 set1 query genes 206
Xlll
5.4 : Expanded set of the Hsa21 set1 207
5.5 : Gene clusters of the Hsa21 gene set1 expanded set 207
5.6 : Proposed gene expression signature for the Hsa21 gene set1 209
5.7 : Drugs predicted to enhance expression of Hsa21 gene set1 210
5.8 : Drugs predicted to repress expression of Hsa21 gene set1 211
5.9 : Hsa21 set2 query genes 212
5.10 : Predicted gene expression signature for the Hsa21 gene set2 213
5.11 : Drugs predicted to enhance expression of Hsa21 gene set2 215
5.12 : Drugs predicted to repress expression of Hsa21 gene set2 215
XIV
LIST OF FIGURES
Figure
11 : Microarray data analysis and the contributions of this research 12
21 : Microarray design and screening 15
22 : Microarray analysis 18
31 : Expression levels of three genes in five experimental conditions 60
32 : ASET2 dataset with five wellseparated clusters 76
33 : PC index (maximize) 77
34 : PE index (minimize) 77
35 : FS index (minimize) 77
36 : XB index (minimize) 77
37 : CWB index (minimize) 78
38 : PBMF index (maximize) 78
39 : BR index (minimize) 78
310 : Performance of the global validity measures on artificial datasets with
different numbers of clusters 79
311 : Dendrogram of the Iris dataset from a 12partition generated by FCM 83
312: Dendrogram of the Wine dataset from a 13 partition generated by FCM 84
313 : ASET1 dataset with 6 clusters 85
314 : PC index on the ASET2 dataset (maximize) 88
315 (Wu [144]) : Impact of m on the misclassification rate in Iris dataset 91
316 : Artificial dataset 5 (ASET5) with three clusters of different sizes 93
41 : Venn diagram for predicted HD AC antagonists 104
42 : Venn diagram for HD AC inhibitors 104
43 : Venn diagram for predicted estrogen receptor antagonists 1 06
XV
44 : Venn diagram for predicted estrogen receptor agonists 106
45 : Candidate cluster centers in the ASET1 dataset found using fzSC. Squares,
cluster centers by FCM; dark circles, cluster centers found by fzSC.
Classes are labeled 1 to 6 118
46 : Average RMSE of 50 trials using an incomplete ASET2 dataset with
different percentages of missing values 142
47 : Average RMSE of 50 trials using an incomplete ASET5 dataset with
different percentages of missing values 143
48 : Average RMSE of 50 trials using an incomplete Iris dataset with different
percentages of missing values 144
49 : Average RMSE of 50 trials using an incomplete Wine dataset with
different percentages of missing values 145
410 : Average RMSE of 50 trials using an incomplete RCNS dataset with
different percentages of missing values 146
411: Average RMSE of 50 trials using an incomplete Yeast dataset with
different percentages of missing values 147
412 : Average RMSE of 50 trials using an incomplete YeastMIPS dataset with
different percentages of missing values 148
413 : Average RMSE of 50 trials using an incomplete Serum dataset with
different percentages of missing values 148
414 : (fzDBI) Average RMSE of 50 trials using an incomplete ASET2 dataset
with different missing value percentages 151
415 : (fzDBI) Average RMSE of 50 trials using an incomplete ASET5 dataset
with different missing value percentages 151
416 : (fzDBI) Average RMSE of 50 trials using an incomplete Iris dataset with
different percentages of missing values 152
XVI
417 : (fzDBI) Average RMSE of 50 trials using an incomplete YeastMIPS
dataset with different percentages of missing values 152
418 : Algorithm performance on ASET2 156
419 : Algorithm performance on ASET3 156
420 : Algorithm performance on ASET4 157
421 : Algorithm performance on ASET5 158
422 : Algorithm performance on IRIS 158
423 : Algorithm performance on WINE dataset 159
424 : Algorithm performance on GLASS dataset 160
425 : A motif model© 175
426 : Dynamic alignment of s='ACG' w.r.t from Figure 425 176
427 : v(t) = Vm x T/ (T + t 2 ). VMis the maximum value of v 184
428 : StrepHtriad motif by HIGEDA 189
51 : Gene expression signature prediction 196
52 : GeneDrug association prediction 199
53 : Expansion of gene set A 200
54 : Identification of key genes 202
55 : Gene expression signature generation and application 204
56 : Transcription factor and mircoRNA binding site prediction using HIGEDA...217
XVll
1. Introduction
Highthroughput microarray technology for determining global changes in gene
expression is an important and revolutionary experimental paradigm that facilitates
advances in functional genomics and systems biology. This technology allows
measurement of the expression levels of thousands of genes with a single microarray.
Widespread use of the technology is evident in the rapid growth of microarray datasets
stored in public repositories. For example, since its inception in the early 1990s, the
Gene Expression Omnibus (GEO), maintained by the National Center for
Biotechnology Information (NCBI), has received thousands of data submissions
representing more than 3 billion individual molecular abundance measurements [6,7].
1.1 Gene expression data analysis challenges
The growth in microarray data deposition is reminiscent of the early days of
GenBank, when exponential increases in publicly accessible biological sequence data
drove the development of analytical techniques required for data analysis. However,
unlike biological sequences, microarray datasets are not easily shared by the research
community, resulting in many investigators being unable to exploit the full potential of
these data. In spite of over 15 years of experience with microarray experiments, new
paradigms for integrating and mining publicly available microarray results are still
needed to promote widespread, investigatordriven research on shared data. In general,
integrating multiple datasets is expected to yield more reliable and more valid results
due to a larger number of samples and a reduction of studyspecific biases.
Ideally, gene expression data obtained in any laboratory, at any time, using any
microarray technology should be comparable. However, this is not true in reality.
Several issues arise when attempting to integrate microarray data generated using
different array technologies. Metaanalysis and integrative data analysis only exploits
the behavior of genes individually at the genotype level. In order to discover gene
behaviors at the phenotype level by connecting genotype to phenotype, additional
analyses are required. Cluster analysis, an unsupervised learning method, is a popular
method that searches for patterns of gene expression associated with an experimental
condition. Each experimental condition usually corresponds to a specific biological
event, such as a drug treatment or a disease state, thereby allowing discovery of the
drug targets, and drug and disease relationships.
Numerous algorithms have been developed to address the problem of data
clustering. These algorithms, notwithstanding their different approaches deal with the
issue of determining the number and the construction of clusters. Although some cluster
indices address this problem, they still have the drawback of model overfitting.
Alternative approaches, based on statistics with the loglikelihood estimator and a
model parameter penalty mechanism, can eliminate over fitting, but are still limited by
assumptions regarding models of data distribution and by a slow convergence with
model parameter estimation. Even when the number of clusters is known a priori,
different clustering algorithms may provide different solutions, because of their
dependence on the initialization parameters. Because most algorithms use an iterative
process to estimate the model parameters while searching for optimal solutions, a
solution that is a global optimum is not guaranteed [122,132].
1.2 Motivations and proposed methods
This dissertation focuses on construction of a machine learning and data mining
framework for discovery of biological meaning in gene expression data collected from
multiple sources. Three tools have been developed: (i) a novel metaanalysis method for
identifying expression profiles of interest; (ii) set of novel clustering methods to
incorporate prior biological knowledge and to predict coregulated genes; and (iii) a
novel motiffinding method to extract information on shared regulatory sequences,
Together these methods provide comprehensive analysis of gene expression data.
1.2.1 Microarray metaanalysis
Several methods have been proposed to combine interstudy microarray data. These
methods fall into two major categories based on the level at which data integration is
performed: metaanalysis and direct integration of expression values. Metaanalysis
integrates data from different studies after they have been preanalyzed individually. In
contrast, the latter approach integrates microarray data from different studies at the level
of expression after transforming expression values to numerically comparable measures
and/or normalization and carries out the analysis on the combined dataset.
Metaanalysis methods combine the results of analysis from individual studies in
order to increase the ability to identify significantly expressed genes or altered samples
across studies that used different microarray technologies (i.e., different platforms or
different generations of the same platform). Lamb et al. [25] and Zhang et al. [26]
proposed rankbased methods for performing metaanalysis. They started within the
Connectivity Map (CMAP) datasets, a collection of oneoff microarray screenings of
>1000 smallmolecule treatments of five different human cell lines. Genes in each array
were ranked using statistical methods and the array data were then integrated, based on
how they fit a given gene expression signature, defined as a set of genes responding to
the biological context of interest. Gene expression signatures could then be used to
query array data generated using other array technologies. These methods successfully
identified a number of small molecules with rational and informative associations with
known biological processes. However, Lamb et al. [25] weighted upregulated genes
and downregulated genes differently, while Zhang et al. [26] showed that genes with
the same amount of expression change should be weighted equally. Both Zhang et al.
[26] and Lamb et al. [25] used fold change criteria (where "fold change", FC, is the
ratio between the gene expression value of treated and untreated conditions) to rank the
genes in expression profiles and they solved the problem of platformspecific probes by
eliminating them. Because of the experimental design and the noise inherent in the
screening process and imaging methods, a single gene may have probesets that differ in
the direction of expression change, e.g., some probesets may be upregulated and others
downregulated. This problem was addressed by Lamb et al. [25] and Zhang et al. [26]
by including all probesets in the ranked list and averaging probesets across genes. In
fact, it is not biologically necessary that a single differentially expressed gene show the
identical FC for all probesets. It is not straightforward to compare differentially
expressed genes just by averaging their probeset values. Witten et al. [9] showed that
FCbased approaches, namely FCdifference and FCratio, are linearcomparison
methods and are not appropriate for noisy data and the data from different experiments.
In addition, these methods are unable to measure the statistical significance of the
results. While approaches using the ttest were proposed, Deng [10] showed that the
RankProduct (RP) method is not only more stable than the ttest, but it also provides
the statistical significance of the gene expression differences. Breitling et al. [29]
proposed an RP based approach that allows the analysis of gene expression profiles
from different experiments and offers several advantages over the linear approaches,
including the biologically intuitive FC criterion, using fewer model assumptions.
However, both the RP and ttest methods, as well as other popular metaanalysis
methods, have problems with small datasets and datasets with a small number of
replicates. Hong et al. [30] proposed an RPbased method that is most appropriate, not
only for small sample sizes, but also for increasing the performance on noisy data. Their
method has achieved widespread acceptance and has been used in such diverse fields as
RNAi analysis, proteomics, and machine learning [8]. However, most RPbased
methods, unlike ttest based methods, do not provide an overall measure of differential
expression for every gene within a given study [54].
We develop a metaanalysis method using RP first to determine the differentially
expressed genes in each study and then to construct ordered lists of up and down
regulated genes in each study based on the pvalues determined using a permutation
based method. We then use the nonparametric rankbased patternmatching strategy
based on the KolmogorovSmirnov statistic [25] to filter the profiles of interest. We
again use RP to filter genes which are differentially expressed across multiple studies.
In our method, a ttest based approach can also be used in the first step to generate an
estimated FC for each gene within each study. These values are then used in the cluster
analysis, instead of the average FC values, to group the genes based on their expression
change pattern similarity across multiple studies.
1.2.2 Cluster analysis of gene expression data
Numerous clustering algorithms are currently in use. Hierarchical clustering results
in a tree structure, where genes on the same branch at the desired level are considered to
be in the same cluster. While this structure provides a rich visualization of the gene
clusters, it has the limitation that a gene can be assigned only to one branch, i.e., one
cluster. This is not always biologically reasonable because most, if not all, genes have
multiple functions. In addition, because of the mechanism of oneway assignment of
genes to branches, the results may not be globally optimal. The alternative approach,
partitioning clustering, includes two major methods, heuristicbased and modelbased.
The former assigns objects to clusters using a heuristic mechanism, while the latter uses
quantifying uncertainty measures. Probability and possibility are the two uncertainty
measures commonly used. While probabilistic bodies of evidence consist of singletons,
possibilistic bodies of evidence are families of nested sets. Both probability and
possibility measures are uniquely represented by distribution functions, but their
normalization requirements are very different. Values of each probability distribution
are required to add to 1, while for possibility distributions, the largest values are
required to be 1. Moreover, the latter requirement may even be abandoned when
possibility theory is formulated in terms of fuzzy sets [172173, 179]. The mixture
model with the expectationmaximization (EM) algorithm is a wellknown method
using the probabilitybased approach. This method has the advantages of a strong
statistical basis and a statisticsbased model selection. However, the EM algorithm
converges slowly, particularly at regions where clusters overlap [27] and requires the
data distribution to follow some specific distribution model. Because gene expression
data are likely to contain overlapping clusters and do not always follow standard data
distributions, (e.g., Gaussian, Chisquared) [28], the mixture model with the EM
method is not appropriate.
Fuzzy clustering using the most popular algorithm, Fuzzy CMeans (FCM) [92,
121], is another modelbased clustering approach that uses the possibility measure.
FCM both converges rapidly and allows assignment of objects to overlapping clusters
using the fuzzifier factor, m, where l<m<oc. When the value of m changes, the cluster
overlaps will change accordingly and there will be no overlapping regions in the
clustering results when the value of m is equal to 1 . However, similar to the EMbased
method and the other partitioning methods, FCM has the problem of determining the
correct number of clusters. Even if the cluster number is known a priori, FCM may
provide different cluster partitions. Cluster validation methods are required to determine
the optimal cluster solution. Unfortunately, the clustering model of FCM is a
possibilitybased one. There is no straightforward statisticsbased approach to evaluate
a clustering model except the cluster validity indices based on the compactness and
separation factors of the fuzzy partitions. However, there are problems with these
validity indices; both with scaling the two factors and with overfit estimates. We have
combined the advantages of the EM and FCM methods, where FCM plays the key role
in clustering the data, and proposed a method to convert the clustering possibility model
into a probability one and then to use the Central Limit Theorem to compute the
clustering statistics and determine the data distribution model that best describes the
dataset. We applied the Bayesian method with loglikelihood ratio and Akaike
Information Criterion measure using the estimated distribution model for a novel
validation method for fuzzy clustering partitions [147].
1.2.3 DNA sequence analysis
Genes within a cluster are expressed similarly under the given experimental
conditions, e.g., a drug treatment, a comparison between the normal and disease tissues,
or other biological comparisons of interest. Genes that are similarly expressed may have
regulatory DNA sequences in common and appropriate analysis may identify motifs
that are overrepresented in these gene sequences. Such motifs may contribute to
expression of the entire group [31]. In addition, regarding clusters of downregulated
genes, we can apply the sequence analysis to the 3' untranslated regions (3 f UTR)
potentially discovering motifs connected to miRNAs [32]. We may discover the
relationships between the miRNAs and the diseases or the treatment conditions targeted
by the microarray experiments.
Most algorithms developed to find motifs in biological sequences do not
successfully identify motifs containing gaps and do not allow a variable number of
motif instances in different sequences. In addition, they may converge to local optima
[33] and therefore do not guarantee that the motifs are indeed overrepresented in the
sequence set. The MEME algorithm [34] solved many of these problems, but still does
not model gapped motifs well. By combining the EM algorithm with the hierarchical
genetics algorithm, we developed a novel motif discovery algorithm, HIGEDA [145],
that automatically determines the motif consensus using Position Weight Matrix
(PWM). By using a dynamic programming (DP) algorithm, HIGEDA also identifies
gapped motifs.
1.3 Research contributions
The rapid growth of microarray repositories has increased the need for effective
methods for integration of datasets among laboratories using different microarray
technologies. The metaanalysis approach has the advantage of integrating data without
the problem of scaling expression values and also has multiple supporting methods to
detect and order significantly differentially expressed genes. However, parametric
statistics methods may perform differently on different datasets, because of different
assumptions about the method model. We propose using the RP method [8] to filter the
genes for differential expression. We also propose to integrate RP with a pattern
matching ranking method, in a method named RPPR [169], a combination of the Rank
Product and Rank Page methods, to more effectively filter gene expression profiles
from multiple studies.
Metaanalysis of gene expression data does not put differentially expressed genes
together at the phenotype level of the experiments. Therefore, the real power of
microarrays is not completely exploited. Cluster analysis can discover relationships
among genes, between genes and biological conditions, and between genes and
biological processes. Most clustering algorithms are not appropriate for gene expression
data analysis because they do not allow overlapping clusters and require the data to
follow a specific distribution model. We developed a new clustering algorithm,
fzGASCE [157] that combines the wellknown optimization algorithm, Genetic
Algorithm (GA), with FCM and fuzzy subtractive clustering to effectively cluster gene
expression data without a priori specification of the number of clusters. Regarding the
parameter initialization problem of the FCM algorithm, we developed a novel Fuzzy
Subtractive Clustering (fzSC) algorithm [146] that uses a fuzzy partition of data instead
of the data themselves. fzSC has advantages over Subtractive Clustering in that it does
not require a priori specification of mountain peak and mountain radii, or the criterion
on how to determine the number of clusters. In addition, the computational time of fzSC
is O(cn) instead of 0(n ), where c and n are the numbers of clusters and data objects
respectively. To address the problem of missing data in cluster analysis, we developed
two imputation methods, fzDBI and fzPBI, using histogram based and probability based
approaches to model the data distributions and apply the model in imputation of missing
data. For the problem of cluster model selection and clustering result evaluation with
the FCM algorithm, we developed a novel method, fzBLE, [147] that uses the
likelihood function, a statistics goodnessoffit, and a Bayesian method with the Central
Limit Theorem, to effectively describe the data distribution and correctly select the
optimal fuzzy partition. Using the statistical model of fzBLE, we developed a
10
probability based method for defuzzification of fuzzy partition that helps with
generating classification information of data objects using fuzzy partition.
In addition to clustering of gene expression data, we also propose an analysis
directed at the biological relevance of the cluster results. Using our RPPR method, we
provide relationships between the gene clusters and the biological conditions of interest,
and determine the clusters which respond positively and negatively to the experimental
conditions. Furthermore, our motif finding algorithm, HIGEDA [145], can discover
common motifs in the promoter and mRNA sequences of genes in a cluster. In addition
to predicting the possible transcription factor binding sites, HIGEDA can also be used
to predict potential miRNA binding sites.
Figure 11 shows the microarray data analysis key issues which are addressed in
this research. The remainder of the dissertation is organized as follows: Chapter 2
presents the background of microarray analysis including microarray technology,
microarray data analysis and recent approaches, and the goals of this research; Chapter
3 describes a rigorous analysis of the Fuzzy CMeans algorithm; Chapter 4 describes
our methods to address the challenges of gene expression microarray data analysis; and
Chapter 5 demonstrates some specific applications of our approach, concludes our
contributions and discusses our future work.
11
Microarray
from labl
Microarray
from lab2
Microarray
from lab (p)
Gene expression
profiles of interest
Determination of optimal
fuzzy partition (fzSC)
Cluster analysis using fzGASCE,
Missing data imputation using fzDBI, fzPBI
Cluster
validation (fzBLE)
OK
Defuzzification fzPBD
Gene groups
Group selection (RPPR)
t _
Motif discovery (HIGEDA)
.1
Biological
context relevance
miRNA
prediction
Figure 11 : Microarray data analysis and the contributions of this research
12
2. Microarray analysis background
2.1 Microarray
DNA microarrays rely on the specificity of hybridization between complementary
nucleic acid sequences in DNA fragments (termed probes) immobilized on a solid
surface and labeled RNA fragments isolated from biological samples of interest [6]. A
typical DNA microarray consists of thousands of ordered sets of DNA fragments on a
glass, filter, or silicon wafer. After hybridization, the signal intensity of each individual
probe should correlate with the abundance of the labeled RNA complementary to that
probe [1].
2.1.1 Microarray platforms
DNA microarrays fall into two types based on the DNA fragments used to build
the array: complementary DNA (cDNA) arrays and oligonucleotide arrays. Although a
number of subtypes exist for each array type, spotted cDNA arrays and Affymetrix
oligonucleotide arrays are the major platforms currently in use. The choice of which
microarray platform to use is based on the research needs, cost, available expertise, and
accessibility [1].
For cDNA arrays, cDNA probes, which are usually generated by a polymerase
chain reaction (PCR) amplification of cDNA clone inserts (representing genes of
interest), are robotically spotted on glass slides or filters. The immobilized sequences of
cDNA probes may range greatly in length, but are usually much longer than those of the
corresponding oligonucleotide probes. The major advantage of cDNA arrays is the
flexibility in designing a custom array for specific purposes. Numerous genes can be
13
rapidly screened, which allows very quick elaboration of functional hypotheses without
any a priori assumptions [120]. In addition, cDNA arrays typically cost approximately
onefourth as much as commercial oligonucleotide arrays. Flexibility and lower cost
initially made cDNA arrays popular in academic research laboratories. However, the
major disadvantage of these arrays is the amount of total input RNA needed. It is also
difficult to have complete control over the design of the probe sequences. cDNA is
generated by the enzyme reverse transcriptase RNAdependent DNA polymerase, and
like all DNA polymerases, it cannot initiate synthesis de novo, but requires a primer. It
is therefore difficult to generate comprehensive coverage of all genes in a cell.
Furthermore, managing large clone libraries, and the infrastructure of a relational
database for keeping records, sequence verification and data extraction is a challenge
for most laboratories.
For oligonucleotide arrays, probes are comprised of short segments of DNA
complementary to the RNA transcripts of interest and are synthesized directly on the
surface of a silicon wafer. When compared with cDNA arrays, oligonucleotide arrays
generally provide greater gene coverage, more consistency, and better quality control of
the immobilized sequences. Other advantages include uniformity of probe length, the
ability to discriminate gene splice variants, and the availability of carefully designed
standard operating procedures. Another advantage particular to Affymetrix arrays is the
ability to recover samples after hybridization to an array. This feature makes Affymetrix
arrays attractive in situations where the amount of available tissue is limited. However,
a major disadvantage is the high cost of arrays [1].
14
Following hybridization, the image is processed to obtain the hybridization
signals. There are two different ways to measure signal intensity. In the twocolor
fluorescence hybridization scheme, the RNA from experimental and control samples
(referred to as target RNAs) are differentially labeled with fluorescent dyes (Cye5  red
vs. Cye3  green) and hybridized to the same array. When the region of the probe is
fluorescently illuminated, both the experimental and control target RNAs fluorescence
and the relative balance of red versus green fluorescence indicate the relative expression
levels of experimental and control target RNAs. Therefore, gene expression values are
reported as ratios between the two fluorescent values [1].
RNA extraction
£
?,
Featurelevel
data (eel, grp...)
Image processing
i
T
Sample preparation
and labelling
LJJ
Hybridization ^^■■Ifi —
t_
T
Image acquisition
1
Washing \—
— ►
Clean
pins
Figure 21 : Microarray design and screening
15
Affymetrix oligonucleotide arrays use a onecolor fluorescence hybridization
system where experimental RNA is labeled with a single fluorescent dye and hybridized
to an oligonucleotide array. After hybridization, the fluorescence intensity from each
spot on the array provides a measurement of the abundance of the corresponding target
RNA. A second array is then hybridized to the control RNA, allowing calculation of
expression differences. Because Affymetrix array screening generally follows a
standard protocol, results from different experiments in different laboratories can
theoretically be combined [6].
Following image processing, the digitized gene expression data need to be pre
processed for data normalization before carrying out further analysis.
2.1.2 Gene expression microarray data analysis
Regarding differentially expressed genes, many protocols use a cutoff of a
twofold difference as a criterion. However, this arbitrary cutoff value may be either too
high or too low depending on the data variability. In addition, inherent data variability is
not taken into account. A data point above or below the cutoff line could be there by
chance or by error. To ensure that a gene is truly differentially expressed requires
multiple replicate experiments and statistical testing [8]. However, not all microarray
experiments are arrayreplicated. We, therefore, need to analyze data generated by
different protocols.
Statistical analysis, including metaanalysis, uses microarrays to study genes in
isolation while the real power of microarrays is their ability to study the relationships
between genes and to identify genes or samples that behave in a similar manner.
16
Machine learning and data mining approaches have been developed for further analysis
procedures. These approaches can be divided into unsupervised and supervised
methods. Unsupervised methods involve the aggregation of samples, genes, or both into
different clusters based on the distance between measured gene expression values. The
goal of clustering is to group objects with similar properties, leading to clusters where
the distance measure is small within clusters and large between clusters. Several
clustering methods from classical pattern recognition, such as hierarchical clustering, K
Means clustering, fuzzy CMeans clustering, and selforganizing maps, have been
applied to microarray data analysis. Using unsupervised methods, we can search the
resulting clusters for candidate genes or treatments whose expression patterns could be
associated with a given biological condition or gene expression signature. The
advantage of this method is that it is unbiased and allows for identification of significant
structure in a complex dataset without any prior information about the objects.
In contrast, supervised methods integrate the knowledge of sample class
information into the analysis with the goal of identifying expression patterns (i.e., gene
expression signatures) which could be used to classify unknown samples according to
their biological characteristics. A training dataset, consisting of gene expression values
and sample class labels, is used to select a subset of expressed genes that have the most
discriminative power between the classes. It is then used to build a predictive model,
also called a classifier (e.g., knearest neighbors, neural network, support vector
machines), which takes gene expression values of the preselected set of genes of an
unknown sample as input and outputs the predicted class label of the sample.
17
Application
specific
Technology
specific
Statistical analysis:
metaanalysis, DEG
Data mining:
Clustering,
Classification
Sequence analysis
Biological Question
l
Experimental Design
Experiment
Quantification ' ^"P^^smg
1
Normalization
I
Estimation
1
I
Testing
ir
v
( >
Clustering
Classification
r
Analysis
Regulatory prediction
Figure 22 : Microarray analysis
2.2 Microarray data metaanalysis
2.2.1 Microarray data integration
The integrated analysis of data from multiple studies generally promises to
increase statistical power, generalizability, and reliability, while decreasing the cost of
analysis, because it is performed using a larger number of samples and the effects of
18
individual study specific biases are reduced. There are two common approaches for the
problem of integrating microarray data: metaanalysis and direct integration of
expression values.
The direct integration procedure [1113, 17] can be divided into the following
steps. First, a list of genes common to multiple distinct microarray platforms is
extracted based on crossreferencing the annotation of each probe set represented on the
microarrays. Crossreferencing of expression data is usually achieved using UniGene or
LocusLink/EntrezGene databases or the best matching mapping provided by
Affymetrix. Next, for each individual dataset, numerically comparable quantities are
derived from the expression values of genes in the common list by application of
specific data transformation and normalization methods. Finally, the newly derived
quantities from individual datasets are combined. Direct integration methods include the
following: Ramaswamy et al. [14] rescaled expression values of a common set of genes
for each of five cancer microarray datasets generated by independent laboratories using
different microarray platforms. Combining them to form a dataset with increased
sample size allowed identification of a gene expression signature that distinguished
primary from metastatic cancers. Bloom et al. [15] used a scaling approach based on
measurements from one common control sample to integrate microarray data from
different platforms. Shen et al. [16] proposed a Bayesian mixture model to transform
each raw expression value into a probability of differential expression (POD) for each
gene in each independent array. Integrating multiple studies on the common probability
scale of POD, they developed a 90gene metasignature that predicted relapsefree
19
survival in breast cancer patients with improved statistical power and reliability. In
addition to common data transformation and normalization procedures, Jiang et al. [17]
proposed a distribution transformation method to transform multiple datasets into a
similar distribution before data integration.
Data processed by distribution transformation showed a greatly improved
consistency in gene expression patterns between multiple datasets. Warnat et al. [18]
used two data integration methods, median rank scores and quartile discretization, to
derive numerically comparable measures of gene expression from independent datasets.
These transformed data were then integrated and used to build support vector machine
classifiers for cancer classification. Their results showed that cancer classification based
on microarray data could be greatly improved by integrating multiple datasets with a
similar focus. The classifiers built from integrated data showed high predictive power
and improved generalization performance.
A major limitation of these direct integration methods is that filtering genes to
generate a subset common to multiple distinct microarray platforms often excludes
many thousands of genes, some of which may be significant. However, data
transformation and normalization methods are resourcesensitive [20]; one method may
be a best fit for some datasets, but not for others. It is difficult to come to a consensus
regarding a method that is best for data transformation and normalization on given
datasets.
Several studies have shown that expression measurements from cDNA and
oligonucleotide arrays may show poor correlation and may not be directly comparable
20
[21]. These differences may be due to variances in probe content, deposition
technologies, labeling and hybridizing protocols, as well as data extraction procedures
(e.g., background correction, normalization, and calculation of expression values). For
example, cDNA microarray data is usually defined as ratios between experimental and
control values and cannot be directly compared with oligonucleotide microarray data
that are defined as expression values of experimental samples.
Acrosslaboratory comparisons of microarray data has also demonstrated that
sometimes there are larger differences between data obtained in different laboratories
using the same microarray technology than data obtained in the same laboratory using
different microarray technologies [22]. Wang et al. [116] also showed that the
agreement between two technologies within the same lab was greater than that between
two labs using the same technology; the lab effect, especially when confounded with the
RNA sample effect, usually plays a bigger role than the platform effect on data
agreement. Commercial microarrays, such as Affymetrix arrays, have produced several
generations of arrays to keep up with advances in genomic sequence analysis. The
number of known genes and the representative composition of gene sequences are
frequently updated and probe sets are modified or added, to better detect target
sequences and to represent newly discovered genes. A recent study has shown that
expression measurements within one generation of Affymetrix arrays are highly
reproducible, but that reproducibility across generations depends on the degree of
similarity of the probe sets and the levels of expression measurements [17]. Therefore,
21
even when using the same microarray technology, different generations of microarrays
make direct integration difficult.
Technical variabilities, which result from differences in sample composition and
preparation, experimental protocols and parameters, RNA quality, and array quality,
pose further challenges to the direct integration of microarray data from independent
studies. Vert et al. [20], and Irizarry et al. [22] described the labaffect for microarray
data and concluded that direct integration of expression data is not appropriate. Cahan et
al. [21] and Warnat et al. [24], using two methods to identify differentially expressed
genes prior to carrying out a classification analysis, showed that gene expression levels
themselves could not be directly compared between different platforms. Therefore, we
propose to utilize the metaanalysis, instead of direct integration, before further
computer analyses.
The metaanalysis method, in contrast to the direct integration method,
combines results from individual analyses. It therefore avoids the problem of scaling the
gene expression levels among datasets from different laboratories. A number of studies
have shown that metaanalysis provides a robust list of differentially expressed genes
[21 23]. With the increasing use of next generation sequencing techniques, microarrays
are no longer the only highthroughput technology for gene expression studies. The
direct integration approach appears to be inappropriate to those repositories, because we
can only scale the expression values across multiple experiments using the same
technology. Even with the same technology, e.g., Affymetrix oligonucleotide and
22
cDNA microarrays, the use of this approach seems impossible with scaling expression
values.
2.2.2 Issues with microarray data metaanalysis
Considering that metaanalysis involves the whole process from choosing
microarray data to detecting differentially expressed genes, the key issues are as
follows.
Issue 1: Identify suitable microarray datasets
The first step is to determine which datasets to use regarding the goal of
analysis. This is done by determining the inclusionexclusion criteria. The most
important criterion is that the datasets collected should be in the standard gene
expression data format, e.g. features by samples [23].
Issue 2: Annotate the individual datasets
Microarray probe design uses short, highly specific regions in genes of interest
because using the fulllength gene sequences can lead to nonspecific binding or noise.
Different design criteria lead to the creation of different probes for the same gene.
Therefore, one needs to identify which probes represent a given gene within and across
the datasets. The first option is to cluster probes based on sequence data [11, 17]. A
sequence match method is especially appropriate for crossplatform data integration, as
well as Affymetrix crossgeneration data integration. However, the probe sequence may
not be available for all platforms and the clustering of probe sequences could be
computationally intensive for very large numbers of probes.
23
Alternatively, one can map probelevel identifiers such as Image ClonelD,
Affymetrix ID, or GenBank accession numbers to a genelevel identifier such as
UniGene, RefSeq, or LocusLink/EntrezGene. UniGene, which is an experimental
system for automatically partitioning sequences into nonredundant geneoriented
clusters, is a popular choice to unify the different datasets. For example, UniGene Build
#211 (released March 12, 2008) reduces the nearly 7 million human cDNA sequences to
124,181 clusters. To translate probelevel identifiers to genelevel identifiers, one can
use either the annotation packages in BioConductor, or the identifier mapping tables
provided by NCBI and AffymetrixID for LocusLink/Entrez ID to probe ID, or UniGene
for probe ID to RefSeq.
Issue 3: Resolve the manytomany relationships between probes and genes
The relationship between probes and genes is unfortunately not unequivocal,
which means that in some cases a probe may report more than one gene, and vice versa.
Even using the same Affymetrix platform, the combination of different chip versions
creates serious difficulties, because the probe identification labels (IDs) are not
conserved from chip to chip. Therefore, to combine microarray data across studies, a
unique nomenclature must be adopted and all the different IDs of the chips must be
translated to a common system.
It is reasonable that many probe identifiers are mapped onto one gene identifier.
This is due to the current UniGene clustering and genome annotation, because multiple
probes per gene provide internal replicates, and allow for poor performance of some
probes without the loss of gene expression detection, and because microarray chips
24
contain duplicate spotted probes. The issue is when a probe identifier can be mapped to
many gene identifiers. This may lead to a problem with the further metaanalysis. For
example, a probe could map to gene X in half of the datasets, but to both genes X and Y
in the remaining datasets. The further metaanalysis will treat such probes as two
separate gene entities, failing to fully combine the information for GenelD X from all
studies. If one simply throws away such probes, valuable information may be lost to
further analysis.
Issue 4: Choosing the metaanalysis technique
The decision regarding which metaanalysis technique to use depends on the
specific application. In this context, we focus on a fundamental application of
microarrays: the twoclass comparison, e.g. the class of treatment samples and the class
of control samples, where the objective is to identify genes differentially expressed
between two specific conditions.
Let X Nx p represent the expression matrix of selected datasets from multiple
studies; where N is the number of common genes in the P selected samples. Let K be
the number of studies from which the samples were selected. Denote by Pk the number
of samples that belong to the k th study. Let P k c be the number of control samples and
P k be the number of treatment or experimental samples from the k th study, then
K
P k = P k c + P k and P = ^ P k . For Affymetrix oligonucleotide array experiments, we
k=l
have P k c chips with gene expression measures from the control class and P k chips with
gene expression measures from the treatment class. For the twochannel array
25
experiments, we assume that the comparisons of logratios are all indirect, i.e., the
samples from the treatment class are hybridized against a reference sample RA. Then
the expression values from the k th study are collected into X:
Xl=log 2 (Ji/R),} = l,...,VZ and X^ q =log 2 (C q /R),q = l,...,P k c . The metaanalysis
will be performed on the results from analyses that were performed on X. There are four
common ways to integrate such information across studies.
Vote counting
This method simply counts the number of studies where a gene was declared as
significant [48]. If the number of studies is small, this method can be visualized
using Venn diagrams.
Combining ranks
Unlike vote counting, this technique accounts for the order of genes declared
significant [30, 5254]. There are three different approaches to aggregate the
rankings of, say, the top 100 lists (the 100 most significantly upregulated or
downregulated genes) from different studies [23]. Two of the algorithms use
Markov chains to convert the pairwise preference between the gene lists to a
stationary distribution; the third algorithm is based on an orderstatistics model.
The rank product method, proposed by Breitling et al. [29] is the most popular
of this approach using the third algorithm. It is a nonparametric statistic which
is originally used to detect differentially expressed genes in a single dataset. It is
derived from the biological reasoning of the fold change (FC) criterion and it
detects the genes that are consistently found among the most strongly up
26
regulated (or downregulated) genes in a number of replicate experiments.
However, this method offers a natural way to overcome the heterogeneity
among multiple datasets and, therefore, can be extended to metaanalysis, which
generates a single significance measurement for each gene in the combined
study [30]. Within a given dataset of the study k, the pairwise fold change is
computed for every gene i,pFC kq =T k /C q , k = l,...,Pj[,q = l,...,P£. There are
Pk x Pjc suc h ra ti° values for each gene i, let R=Pk x Pk • These ratios are ranked
within comparison to generate a ranking matrix rFC N: R. The rank product (RP)
is computed as the geometric mean of rank for each gene i, by
r r V /R
RPi= ri rFc J • (zi)
V r=l J
RPi/n can be interpreted as a pvalue, because it describes the probability of
observing gene i at rank rFQ or better in the q th comparison, q=l,...,R. A
permutation procedure is independently carried out to generate all other possible
combinations of ranks relative to gene i in the dataset, nRPi. Repeat this
procedure B times, where B is a positive number, to form reference distribution
with nRPi across R comparisons which is then used to compute the (adjusted) p
value and false discovery rate (FDR) for each gene.
Combining pvalues
In the 1920s, Fisher developed a metamethod that combined the pvalues from
individual studies [117]. In the k th study, for the i th gene, the pvalue pik is
generated by onesided hypothesis testing. The logs of the pvalues of the same
27
gene are summed across K studies using the Fisher's sum of logs method which
is defined as follows [50, 51],
S i= 2f>g 2 (p ik ). (2.2)
k=l
The Si values are then assessed by comparison against a chisquare
distribution with 2K degrees of freedom, % 2 2K •
Rhodes et al. [36] proposed a statistical model for performing meta
analysis of four independent prostate cancer microarray datasets from cDNA
arrays and Affymetrix arrays. Each gene in each study was treated as an
independent hypothesis and a significance (denoted by one p value and one q
value) was assigned to each gene in each study based on random permutations.
The similarity of significance across studies was assessed with metaanalysis
methods and combined with multiple inference statistical tests for each possible
combination of studies. A cohort of genes was identified to be consistently and
significantly deregulated in prostate cancer. Marot et al. [40] used a sequential
metaanalysis method in which they computed pvalues in each sequential step
and combined pvalues from different sequential steps to determine the pvalues
for the results which are either significant genes or biological processes
throughout the entire analysis process.
Combining effectsizes
The first step is to calculate the effect size dik and the variance 8ik associated
with the effect size for gene i in the k th study, i = 1,. . .,N; k = 1,. . .,K.
28
d ik =^^, (2.3)
'k
where S£ is the pooled standard deviation estimated across N genes in the
th
dataset of the k study.
$1 = JX( N . k "1)< /(IX N) • (24)
i=l / i=l
Based on the mean differences, an estimated pooled mean difference, Lii, and its
variance, af , are computed for each gene across K studies. A zscore zi is then
calculated using \i{ andaf :
z i= m/V? (2.5)
Effect size can be calculated using the correlation coefficient and the method of
Cohen [135], which is the difference between the means of two groups
standardized by its pooled standard deviation [37, 48, 51]. The statistical
significance can be determined using a comparison against N(0,1) on these z
scores. However, if the number of studies, K, is small, there may be a problem
with overfitting. In this case, the permutation method is used instead. The
method from Hedges and Olkin [136] showed that this standardized difference
overestimates the effect size for studies with small sample sizes. They proposed
a small correction factor to calculate the unbiased estimate of the effect size,
which is known as the Hedges' adjustment. The studyspecific effect sizes for
every gene are then combined across studies into a weighted average. The
weights are inversely proportional to the variance of the studyspecific
29
estimates. Choi et al. [3738] introduced a new metaanalysis method, which
combines the results from individual datasets in the form of effect size and has
the ability to model the interstudy variation. The effect size was defined to be a
standardized mean difference between the test samples and normal samples. The
effect sizes from multiple microarray datasets were combined to obtain an
estimate of the overall mean, and statistical significance was determined by the
permutation test extended to multiple datasets. It was demonstrated that data
integration using this method promoted the discovery of small but consistent
expression changes and increased the sensitivity and reliability of the analysis.
An extended effect size model was then proposed by Hu et al. [39] for meta
analysis of microarray data.
Issue 5: Samples /features selection
Regardless of the methods of data integration, finding features or samples that
together represent the biological meaning for the integrated datasets remains a
challenge. Lin [42] used a heuristics method to detect significant genes from integrated
data. The underlying spaces were determined using a heuristics search. Then, a set of
candidate genes lists were identified again using a heuristics search. Parkhomenko et al.
[43] used principle component analysis (PC A) to determine the most important
components of the data spaces. Pairs of components are processed one by one using a
correlation measurements method to look for the 'best' combination of components in
the dataset. This analysis is also a heuristicbased one. Su et al. [44] used a set of genes
similar to the gene expression signature targeted to the biological context of interest, to
30
detect candidate genes called VIP (Very Important Pool) genes. These genes control the
search process to find the significant samples (in the integrated dataset, the samples are
columns, as we mentioned in Issue 1) and genes from the integrated data. Tan et al. [47]
approached the problem using GA and a resampling method; the feature and sample
candidates were selected under a controlled process using GA and the fitness function
was computed using statistical test results on resample sets.
Traditional methods to identify similar samples on the basis of gene expression
values are Principle Component Analysis and Hierarchical Cluster Analysis. These
methods, however, have the problem of strongly relating samples of the same cell lines
and batches because of the similarity among cells grown at the same time [25]. In
addition, these methods require all expression profiles to be generated on the same
microarray platform, because the determination of distances between expression
profiles from different platforms is not straightforward, which limits use to the available
microarray repositories. Lastly, these methods may not benefit from additional
information regarding the biological context of interest. For such cases, Huttenhower et
al. [49] proposed a metaanalysis method based on a statistical approach within a
Bayesian framework. Considering the biological context of the analysis, they measured
the distance between every pair of genes and constructed a Bayesian framework for
each specific biological context based on the set of arrays. The Bayesian framework
was then used to compute the probability of the relevance of each array regarding the
context. This probabilistic measurement can be used to filter the samples. Because their
method requires a set of genes targeting the biological context, in addition to a
31
definition by the biologists, the biological context can be defined by a set of genes
having specific GO (Gene Ontology) terms, or participating in a specific biological
process, or belonging to a gene expression signature.
Lamb et al. [25] and Zhang et al. [26], by using gene expression signatures
proposed patternmatching, used rankingbased approaches to perform metaanalysis on
CMAP datasets. Genes in each expression profiles were ranked using fold changes.
Expression profiles were then integrated based on how they fit a given gene expression
signature. Their research successfully identified a number of interesting compounds that
correlated with known biological processes.
2.3 Cluster analysis
After metaanalysis, we identify a list of genes differentially expressed in a single
study or across multiple studies. Using this gene list, we filter out the matrix of
expression levels of genes significantly expressed under the experimental conditions.
Cluster analysis then groups genes where genes within the same cluster have similar
expression patterns across studies. This analysis allows the simultaneous comparison of
all clusters independently of the platform used and the species studied; similar samples
under specific biological conditions from different species can also be analyzed together
by metaanalysis. This comparison allows identification of: i) robust signatures of a
pathology or a treatment across several independent studies, ii) sets of genes that may
be similarly modulated in different disease states or following drug treatments, iii)
common sets of coexpressed genes between human and animal models. Because a
typical microarray study generates expression data for thousands of genes from a
32
relatively small number of samples, an analysis of the integrated data from several
sources or laboratories may provide outcomes that may not necessarily be related to the
biological processes of interest. Cluster analysis will further explore the groups of
genes that are absolute key agents of the experimental results.
Given a set of n objects X = fxj, X2,...,x n J, let 9  {U, V}, where cluster set V =
{vi, V2,...,v c J and partition matrix U={ukJ, k=l,...,c, i=l,...,n, be a partition of X such
that X = u£ =1 v k and ^° u ki = l,i = l,...,n . Each subset Vk of X is called a cluster and
{uki} is the membership degree of {xi} to Vk. Uki e {0,1} if is crisp partition, otherwise,
u^ e[0,l]. The goal of cluster analysis is to assign objects to clusters such that objects
in the same cluster are highly similar to each other while objects from different clusters
are as divergent as possible. These subgoals create what we call the compactness and
separation factors that are used, not only for modelling the clustering objectives, but
also for evaluating the clustering result. These two parameters can be mathematically
formulated in many different ways that lead to numerous clustering models.
A dataset containing the objects to be clustered is usually represented in one of
two formats, the object data matrix and the object distance matrix. In an object data
matrix, the rows usually represent the objects and the columns represent the attributes of
the objects regarding the context where the objects occur. The roles of the rows and
columns can be interchanged for another representation method, but this one is
preferred because the number of objects is always enormously large in comparison with
the number of attributes. Here objects are genes and attributes are the expression levels
of the genes under the experimental conditions. Assume we have n genes and p
33
experimental conditions. The gene data matrix X then has n rows and p columns where
xy is the expression level of gene i under the condition j and xi is the expression vector
(or object vector) of gene i across p experiments. The distance matrix contains the
pairwise distance (or dissimilarity) of objects. Specifically, the entry (i,j) in the distance
matrix represents the distance between objects i and j, l<i,j<n. The distance of genes i
and j can be computed using the object vectors i and j from the object data matrix based
on a distance measurement. However, the object data matrix cannot be fully recovered
from the distance matrix, especially when the value of p is unknown.
2.3.1 Clustering algorithms
Clustering algorithms are classified into either hierarchical or partitioning
approaches. While the hierarchical methods group objects into a dendrogram (tree
structure) based on the distance among clusters, the partitioning methods directly
groups objects into clusters and objects are assigned to clusters based on the criteria of
the clustering models.
Clustering methods using a partitioning approach are classified into either
modelbased or heuristicbased approaches. Modelbased clustering assumes that the
data were generated by a model and tries to recover the original model from the data.
The model obtained from the data then defines the clusters and assigns objects to them.
Although Kmeans is a heuristicbased approach, once a set of K centers that are good
representatives for the data is found, we can consider it as a model that generates the
data with membership degrees in {0,1} and an addition of noise. In this context, K
Means can be viewed as a modelbased approach.
34
2.3.2 Hierarchical clustering algorithms
Hierarchical clustering (HC) builds a dendrogram (hierarchical structure) of
objects using either agglomerative (bottomup) or divisive (topdown) approaches. In
the former approach, the dendrogram is initially empty; each object is in its own sub
tree (singleton cluster). The clustering process merges similar subtrees. At each step,
the two subtrees most similar are merged to form a new subtree. The process stops
when the desired number of subtrees, say c, at the highest level is achieved. In the top
down approach, the dendrogram starts with a subtree that contains all the objects. The
process then splits the subtrees into new subtrees. At each step, the subtree of objects
with maximum dissimilarity is split into two subtrees. The process stops when the
number of leaves, say c, or the compactness and separation criteria, is achieved. The
dissimilarity between every pair of clusters is measured by either the single linkage:
L s( v k> v i) = min{d(x 1 ,x J );x 1 e v k ,Xj e vj, (2.6)
the complete linkage:
L c( v k ? v 1 ) = max{d(x i ,x j );x i ev k ,Xj evj, (2.7)
or the average linkage:
Ed( Xl , Xj )
L A (v k ,v 1 )= x  eVt r; i l , . (2.8)
v k v n
The clustering process results in a set of c subtrees at the level of interest in
each approach. In comparison with the partitioning approach, each such subtree
corresponds to a cluster.
35
Because of their methods of grouping or splitting subtrees, HC algorithms are
considered heuristicbased approaches. The common criticism for standard HC
algorithms is that they lack robustness and, hence, are sensitive to noise and outliers.
Once an object is assigned to a cluster, it will not be considered again, meaning that HC
algorithms are not capable of correcting prior misclassifications and each object can
belong to just one cluster. The computational complexity for most HC algorithms is at
least 0(n 2 ) and this high cost limits their application in largescale datasets. Other
disadvantages of HC include the tendency to form spherical shapes and reversal
phenomenon, in which the normal hierarchical structure is distorted.
2.3.3 KMeans clustering algorithm
The KMeans algorithm is a partition approach using a heuristicbased method.
Its objective function is based on the square error criterion,
J(U, V  X) = JiXJ^  v k f , where u kl e {0,1}. (2.9)
k=l i=l
KMeans is the bestknown square errorbased clustering algorithm. Its
objective, J, is based on both compactness and separation factors. Minimizing J is
equivalent to minimizing the compactness while maximizing the separation.
36
KMeans algorithm
Steps
1) Initialize the partition matrix randomly or based on some prior knowledge.
2) Compute the cluster set V.
3) Assign each object in the dataset to the nearest cluster.
4) Reestimate the partition matrix using the current cluster set V.
5) Repeat Steps 1, 2 and 3 until there is no object changing its cluster.
The KMeans algorithm is very simple and can easily be implemented to solve
many practical problems. It works very well for compact and hyperspherical clusters.
The computational complexity of KMeans is O(cn). Because c is much less than n, K
Means can be used to cluster large datasets. The drawbacks of KMeans are the lack of
an efficient and universal method for identifying the initial partitions and the number of
clusters c. A general strategy for the problem is to run the algorithm many times with
randomly initialized partitions, although this does not guarantee convergence to a global
optimum. KMeans is sensitive to outliers and noise. Even if an object is distant from
the cluster center, it can still be forced into a cluster, thus, distorting the cluster shapes.
2.3.4 The mixture model with expectationmaximization algorithm
Under the probability context, objects can be assumed to be generated according
to several probability distributions. Objects in different clusters may be generated by
different probability distributions. They can be derived from different types of density
functions (e.g., multivariate Gaussian or tdistribution), or the same families, but with
37
different parameters. If the distributions are known, finding the clusters of a given
dataset is equivalent to estimating the parameters of several underlying models. Denote
by P(vk) the prior probability of cluster Vk and ^!LP( v k) = l> the conditional
probability of object Xi given clustering partition = {U,V} is,
P( Xl e) = XP( Xi u k ,v k )xP(v k ). (2.10)
k=l
Given an instance of 6, the posterior probability for assigning a data point to a
cluster can easily be calculated using Bayes' theorem. The mixtures therefore can be
applied using any type of components. The multivariate Gaussian distribution is
commonly used due to its complete theory and analytical tractability.
The likelihood function P(X6), P(X 0) = ]Xi P ( x i I ) is the P robabilit y of
generating X using the model. The best model , therefore, should maximize the log
likelihood function, L(X  0) = log(P(X  0)) = ^ n i log(P(x 1  0)) ,
L(X  0*) = max{L(X  1 )}. (2.1 1)
can be estimated using the expectationmaximization (EM) algorithm. EM
considers X a combination of two parts: X° is the observations of X; X M is the missing
information of X regarding 0, and X M is similar to U of crisp clustering. The complete
data log likelihood is then defined as,
L(xe) = L(x°,x M e) = XX x riog[p(v k )x P ( x °e k ]. (2.12)
i=l k=l
38
By using an initial value of 0, 0°, EM searches for the optimal value of by
generating a series of estimates {0°, 1 ,..., T }, where T represents the reaching of the
convergence criterion.
EM algorithm
Steps
1) Randomly generate a value of 0, 0°, set t=0.
2) Estep: Compute the expectation of the complete data log likelihood,
QCM^ECLCXIG)).
3) Mstep: Select a new parameter estimate that maximizes Q(.),
t+1 =max e { (3(0,0') }•
4) Set t=t+l. Repeat Steps 2 and 3 until the convergence criterion holds.
EM assumes that the data distribution follows specific multivariate distribution
model, for example, a Gaussian distribution. The major disadvantages of the EM
algorithm are the sensitivity to the selection of the initial value of 0, the effect of a
singular covariance matrix, the possibility of converging to a local optimum, and the
slow convergence rate. In addition, because most biological datasets do not follow a
specific distribution model, EM may not be appropriate.
2.3.5 Fuzzy clustering algorithm
The clustering techniques we have discussed so far are referred to as hard or
crisp clustering, which means that each object is assigned to only one cluster. For fuzzy
clustering, this restriction is relaxed, and the object xi can belong to cluster Vk, k=l,. . .,c,
39
with certain degrees of membership, Uki, Uki e[0,l]. This is particularly useful when the
boundaries among the clusters are not well separated, and are therefore ambiguous.
Moreover, the memberships may help us discover more sophisticated relationships
between a given object and the disclosed clusters. Fuzzy CMeans (FCM) is one of the
most popular fuzzy clustering algorithms. FCM attempts to find a fuzzy partition for a
set of data points while minimizing the objective function as (27). By using afuzzifier
factor, m, FCM is flexible in managing the overlap regions among clusters, hence
n c
X y X y „ m II
j(u,vx) = 2,2 J u k n ;x l v k 
i=l c=l
FCM algorithm
Steps
1) Randomly initialize the values of partition matrix U°, set t=0.
2) Estimate the value of v 1 using
ZKr*.
k n
(2.14)
i=l
3) Compute the new values of U t+1 that maximally fit with v 1 .
X;  V,
m1
M X i _V i
VII l J ll J
m1
4) Repeat Steps 2 and 3 until either {vjj or {u.i} is convergent.
(2.13)
(2.15)
40
As with the EM and KMeans algorithms, FCM performs poorly in the presence
of noise and outliers, and has difficulty with identifying the initial parameters. To
address this issue, FCM is integrated with heuristicbased search algorithms, such as
GA, Ant colony (AC), and Particle swarm optimization (PSO). The mountain method or
Subtractive clustering can also be used as an alternative to search for the best initial
value of V of the FCM algorithm. Due to Possibility Theory limitations, it is difficult to
recover data from a fuzzy partition model. However, recent research has shown that it is
possible to construct a probability model using the possibility one. Therefore, by using a
possibility to probability transformation, it is possible to recover the data from the fuzzy
partition model. The FCM algorithm is considered to be modelbased algorithm.
The FCM algorithm has advantages over its crisp/probabilistic counterparts,
especially when there is a significant overlap between clusters [5759]. FCM can
converge rapidly and provides more information about the relationships between genes
and groups of genes in the cluster results. We therefore choose the FCM algorithm for
the cluster analysis of gene expression data. However, the FCM algorithm also has
some drawbacks. The objective function can help to estimate the model parameters, but
cannot distinguish the "best" solution from the numerous possibly local optimum
solutions. The lack of an effective cluster validation method for FCM prevents its
application to real world datasets, where the number of clusters is not known. FCM is
well known for its rapid convergence, but this is not guaranteed to reach the global
optimum. Because of these issues, further analysis with FCM will be carried out in
Chapter 3 that will help us develop a novel method.
41
2.4 Gene regulatory sequence prediction
The cluster analysis process produces groups of genes with similar expression
patterns. Genes in each group may also have similar regulatory sequences.
Understanding global transcriptional regulatory mechanisms is one of the fundamental
goals of the postgenomic era [69]. Conventional computational methods using
microarray data to investigate transcriptional regulation focus mainly on identification
of transcription factor binding sites. However, many molecular processes contribute to
changes in gene expression, including transcription rate, alternative splicing, nonsense
mediated decay and mRNA degradation (controlled, for example, by miRNAs, or RNA
binding proteins). Thus, computational approaches are needed to integrate such
molecular processes.
miRNAs bind to complementary sites within the 3 f UTRs of target mRNAs to
induce cleavage and repression of translation [175]. In the past decade, several hundred
miRNAs have been identified in mammalian cells. Accumulating evidence thus far
indicates that miRNAs play critical roles in multiple biological processes, including cell
cycle control, cell growth and differentiation, apoptosis, and embryo development. At
the biochemical level, miRNAs regulate mRNA degradation in a combinatorial manner,
i.e., individual miRNAs regulate degradation of multiple genes, and the regulation of a
single gene may be conducted by multiple miRNAs. This combinatorial regulation is
thought to be similar in scope to transcription factor partner regulation. Therefore, in
addition to transcription factors and other DNA/RNAbinding proteins, comprehensive
42
investigations into transcriptional mechanisms underlying alterations in global gene
expression patterns should also consider the emerging role of the miRNAs.
Motif finding algorithms are designed to identify transcription factor binding sites
(TFBS) and miRNA complementary binding sites (MBS). These sequences are usually
short, ~ 8bp and 23bp, respectively. In addition, each gene may have zero, one, or more
such binding sites (BS). These BSs or motifs may be overrepresented differently in
DNA and RNA sequences of genes in the same group. Motifs are difficult to recognize
because they are short, often highly degenerate, and may contain gaps. Although motif
finding is not a new problem, challenges remain because no single algorithm both
describes motifs and finds them effectively. Popular methods for motiffinding use a
position specific score matrix (PSSM) that describes the motifs statistically, or
consensus strings that represent a motif by one or more patterns that appear repeatedly
with a limited number of differences. Of the two, PSSM is preferred because it is more
informative and can easily be evaluated using statistical methods. In addition, a
consensus motif model can be replaced by a PSSM one.
One of the most popular motif finding algorithms using PSSM is MEME proposed
by Bailey and Elkan [70]. The advantage of MEME is that it uses expectation
maximization (EM) which, as a probabilitybased algorithm, produces statistically
significant results if it can reach a global optimum. The disadvantage is that for motif
seeds, it uses existing subsequences from within the sequence set and, as a result, may
fail to discover subtle motifs. Chang et al. [72] and Li et al. [76] overcame this
drawback by using the genetics algorithm (GA) to generate a set of motif seeds
43
randomly. However, because they used GA with random evolution processes, a rapid
convergence to a solution is not assured. Li et al. [77] improved on MEME by using one
instance of a position weight matrix (PWM), a type of PSSM, to represent a motif, and
statistical tests to evaluate the final model. However, because EM may converge to
local optima, use of a single PWM may fail to find a globally optimal solution. The GA
based methods of Wei and Jensen [83] and Bi [71] use chromosomes to encode motif
positions. The method in the former [83] is appropriate for models with zero to one
occurrence of the motif per sequence (ZOOPS); the latter [71] is appropriate for models
with one occurrence per sequence (OOPS), because one variable can be used to
represent the single motif occurrence in each sequence. However, Li et al. [77] recently
showed that ZOOPS and OOPS are inadequate when not every sequence has the same
motif frequency, and that the twocomponent mixture (TCM) model, which assumes a
sequence may have zero or multiple motif occurrences, should be used. However, TCM
requires a set of variables for every sequence to manage motif positions, and hence, the
size of a chromosome can approach the size of the dataset [145]. Lastly, the above
algorithms are restricted to finding gapless motifs and, therefore, will fail to find many
functionally important, gapped motifs. While some methods, e.g., patternbased
methods of Pisanti et al. [81] and Frith [73], allow gapped motifs, they require the
gapped patterns to be welldefined and they generate gap positions randomly or by
using a heuristic method. Alternatively, Liu et al. [78] used neural networks to find
gapped motifs, but their approach required a limited and specific definition of the neural
network structure.
44
2.5 Datasets
In order to evaluate the previous studies as well as our proposed methods for
microarray analysis, we will use multiple data types with different levels of complexity.
Artificial datasets are used in both clustering and motiffinding and are preferred for
testing methods because of the ease in generating and running benchmarks. For cluster
analysis, we generated datasets using a finite mixture model of data distribution [27],
because known cluster structures are ideal for evaluating the capability of an algorithm
for determining the number of clusters and the cluster prototypes. For sequence
analysis, we generated motifs in the set of randomly generated DNA sequences using
the normal distribution model [71] with a known number of motif occurrences and
locations in every sequence. Some popular datasets in the Machine Learning
Repository, University of California Ivrine (UCI) [84], are also used to show how
effective our methods are for cluster analysis using real datasets. The most complex
datasets, e.g., gene expression data, including CMAP datasets, are the most challenging
of clustering problems. Lastly, for sequence analysis, we used eight DNA transcription
factor binding site datasets; two eukaryotic datasets: ERE and E2F [71], and six
bacterial datasets: CRP, ArcA, ArgR, PurR, TyrR and IHF [80].
2.5.1 Artificial datasets for cluster analysis
Artificial datasets were generated using a finite mixture model of data distribution
[27] with different numbers of data objects, clusters, and dimensions. The clusters were
generated with a slight overlap (overlapping ratio = 0.1). For test purposes, we selected
five datasets, named from ASET1 to ASET5, where the first four datasets are uniform
45
and the last one is nonuniform. Datasets of different dimensions support the stability of
our methods. We expect our methods to properly evaluate clustering results,
successfully detect the correct number of clusters, properly locate the cluster centers,
and to correctly assign data points into their actual clusters.
2.5.2 Artificial datasets for sequence analysis
We generated simulated DNA datasets as in Bi [71], with three different
background base compositions: (a) uniform, where A, T, C, G occur with equal
frequency, (b) ATrich (AT = 60%), and (c) CGrich (CG = 60%). The motif string,
GTCACGCCGATATTG, was merged once or twice into each sequence, after a defined
level of the string change: (i) 9% change representing limited divergence (i.e. 91% of
symbols are identical to the original string), (ii) 21% change, or (iii) 30% change which
is essentially background or random sequence variation.
2.5.3 Real clustering datasets
Some popular real datasets from the UCI Machine Learning Repository [84] were
used to test our method.
Iris data
The Iris dataset contains information about the sepal length, the sepal width, the
petal length, and the petal width of three classes of Iris flowers; setosa, versicolor and
virginica. Each class has 50 data objects from a total of 150 data objects in the dataset.
This dataset does not contain missing values. The expected results should have the three
classes with their own data objects.
46
Wine
This dataset contains 178 data objects, each with 13 attributes corresponding to the
measures of alcohol, malic acid, ash, alcalinity of ash, magnesium, total phenols,
flavanoids, nonflavanoid phenols, proanthocyanins, color intensity, hue, OD280/OD315
of diluted wines and proline. The data objects are known to belong to three different
classes. There are no missing values in the dataset. We expect to detect three clusters in
the dataset and properly assign the data objects to their classes.
Glass
This dataset contains 214 data objects, each having ten attributes of which nine are
the glass characteristics of refractive index, sodium, magnesium, aluminum, silicon,
potassium, calcium, barium, and iron content. The last attribute, a value ranging from 1
to 6, is for glass identification. The dataset with the first nine attributes is used for
clustering and the results should contain six clusters and assign the data objects to their
proper class identification.
Breast Cancer Wisconsin
The Breast Cancer Wisconsin (BCW) dataset contains 699 data objects with nine
attributes regarding the clump thickness, uniformity of cell size, uniformity of cell
shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin,
normal nucleoli, and mitoses. The objects of the datasets are from two different classes
of tumor state: benign and malignant. The number of classes in this dataset is unknown.
Recent research has concluded that 6 is a reasonable number of classes. We expect to
discover 6 clusters in the dataset using our methods.
47
2.5.4 Gene expression datasets
Yeast and YeastMIPS data
The yeast cell cycle data showed expression levels of approximately 6000 genes
across two cell cycles comprising 17 time points [85]. By visual inspection of the raw
data, Cho et al. [85] identified 420 genes that show significant variation. From the
subset of 420 genes, Yeung et al. [102] selected 384 genes that achieved peak values in
only one phase, and obtained five standard categories by grouping into one category
genes that peaked during the same phase. Among the 384 selected genes, Tavazoie et al.
[118], through a search of the protein sequence database, MIPS [86], found 237 genes
that can be grouped into 4 functional categories: DNA synthesis and replication,
organization of centrosome, nitrogen and sulphur metabolism, and ribosomal proteins.
The functional annotations show the cluster structure in the dataset. We named these
two subsets Yeast and YeastMIPS respectively. We expect clustering results to
approximate the five class partition in the former and a four class partition in the latter.
RCNS  Rat Central Nervous System data
The RCNS dataset was obtained by reverse transcriptioncoupled PCR designed to
study the expression levels of 1 12 genes over nine time points during rat central nervous
system development [87]. Wen et al. [87] classified these genes into four functional
categories based on prior biological knowledge. These four classes are external criterion
in this dataset.
48
2.5.5 CMAP datasets
CMAP datasets contain expression profiles from multiple drug treatments
performed at different concentrations. Originally there were 164 distinct small
molecules used in 453 different treatment instances resulting in 564 expression profiles.
These datasets have been used to detect associations between genes and drugs. Lamb et
al. [25] and Zhang et al. [26] were successful in recovering and exploiting some gene
drug associations using gene expression signatures with matchingandranking based
methods. These datasets will be used in our approach to demonstrate that we can
replicate the results of Lamb et al. [25] and Zhang et al. [26]. In addition, more gene
expression signatures will be discovered and the associations between DNA motifs and
drugs will be established using our methods. By discovering novel gene expression
signatures and related pathways for the drugs, we expect to exploit the desired effects as
well as the sideeffects of drug treatments. Additionally, by using our motiffinding
methods, we hope to predict some of the associations between drugs and DNA
sequences and the connections between miRNAs and diseases.
2.5.6 Real biological sequence datasets
A. Eukaryotic transcription factor binding sites datasets
E2F dataset
E2F transcription factors play a key role in the regulation of cell cycleassociated
genes. Identifying target genes for these factors is a major challenge. We used the set of
25 sequences from Kel et al. [35] that contain 27 instances of the E2F transcription
49
factor binding site. The goal is to discover motif occurrences that locate exactly or
overlap with binding sites at known positions.
ERE dataset
The estrogen receptor (ER) is a ligandactivated enhancer protein that is a member
of the steroid/nuclear receptor superfamily. Two genes encode mammalian ERs,
ERalpha and ERbeta. In response to estradiol (E2), ER binds with high affinity to
specific DNA sequences called estrogen response elements (EREs) and transactivates
gene expression. We used the set of 25 sequences from Klinge [14] each of which
contains an instance of the ERE. The purpose is to detect the motif sequences located
exactly at or overlapping with EREs.
B. Bacterial transcription factor binding sites datasets
CRP dataset
CRP is one of seven "global" transcription factors in E. coli known to regulate
more than 100 transcription units. CRP's activity is triggered in response to glucose
starvation and other stresses by the binding of the second messenger cAMP. CRP
binding sites have proved to be particularly noisy because computational searches for
consensus binding sites have missed many known binding sites. CRP was chosen for its
highly indiscriminate binding site [115].
ArcA dataset
ArcA is a global regulator that changes in relation to the expression of fermentation
genes and represses the aerobic pathways when E. coli enters low oxygen growth
50
conditions. Arc A was chosen for its different protein domain (CheYlike) and very low
consensus binding site [115].
ArgR dataset
ArgR, complexed with Larginine, represses the transcription of several genes
involved in biosynthesis and the transport of arginine, histidine, and its own synthesis,
and activates genes for arginine catabolism. ArgR is also essential for a sitespecific
recombination reaction that resolves plasmid ColEl multimers to monomers and is
necessary for plasmid stability.
PurR dataset
PurR dimers control several genes involved in purine and pyrimidine nucleotide
biosynthesis and its own synthesis. This regulator requires binding of two products of
purine metabolism, hypoxanthine and guanine, to induce the conformational change that
allows PurR to bind to DNA.
TyrR dataset
TyrR, tyrosine repressor, is the dual transcriptional regulator of the TyrR regulon
that involves genes essential for aromatic amino acid biosynthesis and transport. TyrR
can act both as a repressor and as an activator of transcription at a 70 dependent
promoters.
IHF dataset
IHF, Integration Host factor, is a global regulatory protein that helps maintain DNA
architecture. It binds to and bends DNA at specific sites. IHF plays a role in DNA
51
supercoiling and DNA duplex destabilization and affects processes such as DNA
replication, recombination, and the expression of many genes. These sequence sets were
chosen because of their highly degenerate binding sites.
C. Protein sequences
Sequences from protein families in PFAM and Prosite databases with known motifs
are used to evaluate the performance of our method. The DUF356, StrepHtriad and
Excalibur (extracellular calciumbinding) protein families have unambiguous motifs.
Those from Nupretrotrp and FlagellinC families have weak signals. Sequences from
these families are also used to evaluate the capability of detecting subtle motifs. In
addition to these families, sequences from Xin Repeat and Planctomycete Cytochrome
C protein families are used because they have more variable motifs. For gapped motif
examples, we used sequences from the zfC2H2, EGF2, LIGCYCLIN1, LIG_PCNA
and MODTYRITAM families. To test the ability of de novo motif prediction, we
used sequences from ZZ, Myb and SWIRM domains. Because motifs of these domains
are unknown, the results from our method are compared with other motiffinding
algorithms and we expect that our methods will agree with the results of others as
shown in Chapter 4.
52
3. Fuzzy cluster analysis using FCM
3.1 FCM algorithm
The FCM algorithm was initially developed by Dunn [121] and generalized later by
Bezdek [92] with an introduction of the fuzzifier, m, is the best wellknown method for
fuzzy cluster analysis. FCM allows gradual memberships of data points to clusters
measured as degrees in [0,1]. This provides the flexibility in describing the data points
that can belong to more than one cluster. In addition, these membership degrees provide
a much finer picture of the data model; they afford not only an effective way of sharing
a data point among multiple clusters but also express how ambiguously or definitely the
data point belongs to a specific cluster.
Let X = {xi, X2,..., x n } be the dataset in a pdimensional space R p where each xi =
(xn, Xi2,..., Xi P ) g R p is a feature vector or pattern vector, and each Xy is the j th
characteristic of the data points xi. Let c be a positive integer, 2 < c < n, R cn denote the
vector space of all real c x n matrices. Let V = {vi, V2,. • ., v c } e R cp of which Vk e R p is
the center or prototype of the k th cluster, 1 < k < c.
The concept of the membership degrees is substantiated by the definition and
interpretation of fuzzy sets. Fuzzy clustering, therefore, allows got a fine grained
solution space in the form of fuzzy partitions of the dataset. Each cluster Vk in the
cluster partitions is represented by a fuzzy set Lik as follows: complying with fuzzy set
theory, the cluster assignment Uki is now the membership degree of the data point xi to
cluster Vi, such that u^ = Uk(xO e [0,1]. Because memberships in clusters are fuzzy,
53
there is not a single cluster label assigned to each data point. Instead, each data point xi
will be associated by a fuzzy label vector that states its memberships in c clusters,
u .i=( u ii> u 2i>> u ci} T  C 3  1 )
Definition 3.1 (fuzzy cpartition): Any matrix U = {uki} e M fcn defines a fuzzy c
partition (or fuzzy partition), where
M fcn = u e R cn  u kl e [0,1] Vi,k;Ju kl = 1 Vi;0 < ]Tu kl < n Vkj . (3.2)
I k=l i=l J
Row k 1 of a matrix U e M fcn exhibits the k th membership function Uk in the fuzzy
partition matrix U. The definition of a fuzzy partition matrix (3.2) restricts assignment
of data to clusters and what membership degrees are allowed.
Definition 3.2 (fuzzy c means Junctionals): Let J m : M cn x R cp — > R + be,
J m (U,V) = XXuSd 2 (x i ,v k )^min, (3.3)
i=l k=l
where m is the fuzzifier factor, l<m<oc, also called the weighting exponent, which
controls the degree of fuzziness of the clustering; J m (U,V) is the infinite family of
objective functions, and d(.) is any inner product norm of R p :
d 2 (x„v k )= x,v k \ =( X] v k ) T A(x 1 v k ), (3.4)
9 9
where A is a positive definite matrix. If A = I pxp then d (x,y) = x  y .
Theorem 3.1 (prototypes oj FCM): If m and c are fixed parameters, and I, I are sets
defined as:
54
Vi,l<i<n
fl i ={kl<k<c;d(x i ,v k ) = 0},
(3.5)
_] i ={l,2,...,c}/I i
then (U,V) e (M fcn x R cp ) may be a global minimum for J m (U,V) with respect to (w.r.t)
the restriction
I2>« =1 '
l<i<n
k=l
as in (3.2), only if:
and
V u ki = <^
l<i<n kl
l<k<c
1
v d 2 ( X „v k ) y
[0, k £ I,
ll, kel '
m1 / c
1=1
d 2 (x i5Vl ) y
m1
I i=
L*0
v v k =z< x 7z<
l<k<c
i=l
i=l
(3.6)
(3.7)
(3.8)
Proof. We consider minimizing the J m w.r.t. U under the restriction (3.6) using
Lagrange multiplier. Let the Lagrange multiplier be A,i, i=l,. . .,n and put
L=II<d 2 (x l ,v k )2> 1 £> kl i).
i=l k=l
i=l k=l
As the necessary optimal condition,
8L
du Vi
= m(ur)d 2 (x i ,v k )X 1 =0,
(3.9)
55
dL
dv
= 22uSx I v k  = 0.
(3.10)
k i=l
Assume that k £ l{ then d (xi, Vk) * 0. We derive from the Equation (3.9) for each k:
i
i
K
md 2 (x ; ,v k ) y
m_i i% i m_i
K m)
V d2 ( x .' v k)y
ml
l<k<c. (3.11)
Summing up for k=l,. . .,c and taking (3.6) into account, we have,
X : V" 1 '
z«h z
k=l V m / k=l
1
v d (x.,v k ),
= 1.
Hence,
00
m
ml
Viny
z
k=l
1
d 2 (x ; ,v k )
ml
Together with (3. 1 1), we obtain the following for the membership,
u ki =
1
^
1=1
Vi v
d 2 ( Xi ,v k )
ml
v d 2 ( Xl , Vl ) y
Thus,
u kl =
v d2 ( x .> v k)y
ml / c
1=1
^m1
v d 2 (x 1 ,v 1 ) y
Similarly, we derive from (3.10) for each k, l<k<c:
v k =5>s*,/5>s
i=l
i=l
Q.E.D
56
3.2 FCM convergence
The iteration process of the FCM algorithm can be described using a map:
3 m : (U, V) > (U, V) where U = F(V) and V = G(U) .
(U w , V w ), t=l,2... is called an iteration sequence of the FCM algorithm. If
(U (t) ,V (t) ) = 3 m (U (t  l \V (t  l) ), t>l, where (U (0) ,V (0) ) is any element of M fcn x R cp .
Set
a = I(u\ y% m x R  J >\V)^ J >, v> ) vu  m,u * v'M
1 J,(U.V)<J„(U*,V) W€R',V*V*
Theorem 3.2 (descent function  Bezdek) [133]: J m is a descent function for {3 m , Q}
Proof: Because {y ^ d 2 (y) } and {y > y m } are continuous, J m is the sum of products of
such functions, J m is continuous on {M fcn , R cp }.
J m (3 m (U,V)) = J m (F(G(U)), G(U))
< J m (U,G(U)) by the first case in the definition of Q
< J m (U, V) by the second case in the definition of Q.
Thus, J m is a descent function for {3 m , Q}. Q.E.D
Theorem 3.3 (solution convex set): Let [conv(X)] c be the cfold Cartesian product of the
convex hull of X, and let (U°, V°) be the starting point of the sequence of 3 m iteration,
U° e M fcn and V° = G(U°). Then 3 m (U (t " 1} ,V (t " 1} ) e M fcn x [conv(X)] c , 1=1,2. . .
M fcn x[conv(X)] c is compact in M fcn xR cp .
Proof Let U° e M fcn be chosen. For each k, 1 < k < c, we have
57
i=l / i=l
Let p kl = (KiT/ll( u liT > we have v k ^PA , and £p ki = 1
i=l i=l i=l
Thus v° k g conv(X) , and therefore V° e [conv(X)] c .
Subsequently, we can prove that U (t) g M fcn , V (t) g [conv(X)] c , Vt > 1. Q.E.D
Theorem 3.4 (the convergence of FCM  Bezdek et al.) [134]: Let 0° = (U°, V°) be the
starting point of the iterations with 3 m . The sequence 3^(U 5 V°) 5 t = 1,2... either
terminates at an optimal point = (U , V ) g Q, or there is subsequence converging to
a point in Q.
Proof: 3 m is a continuous and descent function (Theorem 3.2) and the iteration
sequences are always in a compact subset of the domain of J m (Theorem 3.3). Thus, 3 m
should terminate at an optimal point = (U ,V ) or there is a subsequence converging
to a point in Q. Q.E.D
3.3 Distance measures
Similar to other clustering algorithms, FCM uses the distance or dissimilarity of
data points, each described by a set of attributes and denoted as a multidimensional
vector. The attributes can be quantitative or qualitative, continuous or discrete, which
leads to different measurement mechanisms. Accordingly, for a dataset in the form of
an object data matrix, the data matrix is designated as twomode, because its row and
column indices have different meanings. This is in contrast to the onemode distance
58
matrix, which is symmetric with elements representing the distance or dissimilarity
measure for any pair of data points, because both dimensions share the same meaning.
Given a dataset X, the distance function d(.) on X is defined to satisfy the
following conditions,
Symmetric: d(x,y) = d(y,x),
Positive: d(x,y) > 0, V x,y eX.
If the following conditions hold, then d is a metric.
Triangle inequality: d(x,z) < d(x,y) + d(y,z), Vx,y,zeX,
Reflexive: d(x,y) = iif x = y.
If the triangle inequality is violated, then d is a semimetric.
Many different distance measures have been used in cluster analysis.
Euclidean distance
The Euclidean distance is the most widely used distance measure and defined as
d 2 (x,y) = X(x 1 y 1 ) 2 . (3.12)
i=l
This distance is a true metric because it satisfies the triangle inequality. In the
Equation (3.12), the expression data Xi and yi are subtracted directly from each other.
We, therefore, need to ensure that the expression data are properly normalized when
using the Euclidean distance, for example by converting the measured gene expression
levels to logratios.
59
Pearson correlation distance
The Pearson correlation coefficient distance is based on the Pearson correlation
coefficient (PCC), which describes the similarity of objects as
PCC(x ? y)^f (X " l)(y ^ )
P !=1 °x°y
where a x and a y are the sample standard deviation of x and y respectively. PCC has a
value from 1 to 1, where PCC=1 when x and y are identical, PCC=0 when they are
unrelated, and PCC = 1 when they are anticorrelated. The Pearson's correlation
distance is then defined as
d (x,y) = lPCC(x,y).
(3.13)
Euclidean
Correlation
Figure 31 : Expression levels of three genes in five experimental conditions
The value of d p lies in [0,2] where d p =0 implies that x and y are identical, and
d p =2 implies that they are very different from each other. While the Euclidean measure
takes the magnitude of the data into account, the Pearson correlation measure
geometrically captures the pattern of the two data points [1]. It is therefore widely used
for gene expression data. Even if the expression levels of the two genes are different, if
60
they peak similarly in the experiments, then they have a high correlation coefficient, or
a small Pearson correlation distance (Figure 31). Because of its definition, the Pearson
correlation distance satisfies the distance conditions, but it is not a metric since it
violates the triangle inequality.
Absolute Pearson correlation distance
The distance is defined as,
d ap (x,y) = lPCC(x,y). (3.14)
Because the absolute value of PCC falls in the range [0,1], the d ap distance also
falls between [0,1]. d ap is equal to 1 if the expression levels of the two genes have the
same shape, i.e., either exactly the same or exactly opposite. Therefore, d ap (3.14)
should be used with care.
Uncentered Pearson correlation distance
The distance is based on the uncentered Pearson correlation coefficient (UPCC)
which is defined as
UPCC(x,y) = ±£^
P i=l °x ^y
where o° z = ^ zf /p . The uncentered Pearson correlation distance is defined as
i=i
d up (x,y) = lUPCC(x,y). (3.15)
The d up distance in Equation (3.15) may be appropriate if there is a zero
reference state. For instance, in the case of gene expression data given in terms of log
61
ratios, a logratio equal to corresponds to green and red signal being equal, which
means that the experimental manipulation did not affect the expression. Because UPCC
lies in the range [1,1], the d up distance falls between [0,2].
Absolute uncentered Pearson correlation distance
This distance is defined as in (3.16). Because the absolute value of UPCC falls
between [0,1], the d aup distance also falls between [0,1].
d aup (x,y) = lUPCC(x,y). (3.16)
Comparison of measures
The distance measure plays an important role in obtaining correct clusters. For
simple datasets, where the data is multidimensional, the Euclidean distance measure is
employed. But as the dimensions of the dataset increase, where each dimension denotes
a specific aspect of the dataset, the Euclidean distance measure may not be the best one
to be used.
For the Iris dataset, we ran the FCM algorithm with the number of clusters set to
3, which is the number of classes of Iris flowers: Setosa, Versicolor, and Virginica in
the dataset. By using five different distance measures from (3.12) to (3.16), we obtained
five different cluster partitions. For each partition, we compared the cluster label of
each data point with its class label to compute the correctness ratio for every class. The
results are shown in Table 3.1. While the four distance measures using Pearson's
correlation coefficient provide correct results, the Euclidean distance measure performs
worse, particularly on the Virginica class. In this comparison, it is clear that the absolute
62
Pearson correlation measure and the Pearson correlation measure results are identical.
Similarly, the result of the uncentered Pearson correlation measure is identical with that
of the absolute uncentered Pearson correlation measure.
Table 3.1 : Performance of different distance measures on the Iris dataset
Distance measure method
Classification correctness (%)
Average
Correctness
Setosa
Versicolor
Virginica
Euclidean
100.00
94.00
76.00
90.00
Pearson correlation
100.00
94.00
94.00
96.00
Abs. Pearson correlation
100.00
94.00
94.00
96.00
Uncentered Pearson cor.
100.00
92.00
100.00
97.33
Absolute uncentered Pearson cor.
100.00
92.00
100.00
97.33
Table 3.2 : Performance of different distance measures on the Wine dataset
Distance measure method
Classification correctness (%)
Average
Correctness
#1
#2
#3
Euclidean
76.27
70.42
56.25
67.65
Pearson correlation
91.53
85.92
100.00
92.48
Abs. Pearson correlation
72.88
67.61
95.83
78.77
Uncentered Pearson cor.
91.53
85.92
97.92
91.79
Absolute uncentered Pearson cor.
91.53
85.92
97.92
91.79
We ran a similar benchmark on the Wine dataset which contains data objects of
the three different classes of wines. Results are shown in Table 3.2. Again, all Pearson
correlation based distance measures performed better than the Euclidean distance
measure. The Pearson correlation measures seem to be better for distance measure in
real and unsealed datasets.
63
3.4 Partially missing data analysis
In cluster analysis, complete information is preferred throughout the experiment.
Unfortunately, real world datasets frequently have missing values. This can be caused
by errors that lead to incomplete attributes or by random noise. For example, sensor
failures in a control system may cause the system to miss information. For gene
expression data, the missing values can come from the platform level and metaanalysis
level. With the former, the reasons include insufficient resolution, image corruption,
spotting, scratches or dust on the slide, or hybridization failure. The latter can be caused
by differences in platforms or chip generations, and the lack of an appropriate method
to map probes onto genes across different microarray platforms.
In a dataset with partially missing data, some attribute values of a data point xi
may be not observed. For example, xi = (xn, Xi 2 , ?, ?, Xi 5 , ?) has missing values
corresponding to the third, fourth and sixth attributes and only the first, second and fifth
attributes are observed. This causes a problem with distance measurement between xi
and other data points in the dataset and the cluster algorithm, therefore, cannot perform
properly at xi. Let
Xw = { xi g X xi is a complete data point },
X P = { xij  Xi^ ? }, and
X M = { Xij  Xij = ? } .
Clearly, X P and X contain more information than X w [140].
64
Three approaches [137139] have been widely used to address the problem of
missing values. (1) The ignorancebased approach is the most trivial approach to deal
with datasets when the proportion of incomplete data is small, X / X w  (( X, but the
elimination brings a loss of information. (2) The Modelbased approach defines a model
for the partially missing data, X M , using available information, X P , and applies the
model to build a complete dataset. However, the complexity of this method can prevent
application to large datasets. (3) The Imputationbased approach supplies missing
values X M by certain means of approximation. Of the three, the Imputationbased
approach is preferred. Because of the need to use an approximation model for the
missing data, it is therefore integrated with the cluster algorithm and works based on the
cluster model, which may be a probabilitybased or possibilitybased one, estimated by
the cluster algorithm.
For the Imputationbased approach using FCM, Hathaway et al. [140] identified
four strategies to solve the problem of missing values.
Whole data strategy (WDS)
This strategy, similar to the ignorancebased approach, eliminates the missing
data. Results therefore present the cluster partition of X w instead of X. This strategy is
applicable when X P  / X > 0.75.
Partial distance strategy (PDS)
This is the strategy recommended when X M is large. The distance between every
pair of data points is computed using X P .
65
d^(x i ,v k ) = x i v k  AIi =(x i v k ) 1 (AI i )(x i v k ), (3.17)
where Ii is the index function of the data point xi, defined as in (3.18).
fO, x, eX M
h=\: " V M forl<i<n,l<t<p. (3.18)
[1, x it eX p
Optimal completion strategy (OCS)
In addition to model parameters, this strategy estimates X M . The FCM algorithm
is modified to meet this method: in the first iteration, X w is used instead of X, and X M is
estimated in the favor of optimizing the J m function using (3.19). FCM is then run using
the entire dataset X until convergence.
I<
; XV,,
*,=—< ,x lje X M . (319)
k=l
Nearest prototype strategy (NPS)
The nearest prototype strategy is in fact a version of the knearest neighbors
strategy [89] for fuzzy partition. This is also considered the simple version of OCS. In
each iteration, X M is updated as,
x u =v kj , where x {j eX M ,dJ(x i ,v) = mjdJ(x i ,v 1 )). (3.20)
Algorithm performance measures
Hathaway et al. [140] proposed two measures for performance evaluation of
imputation algorithms. The misclassification measure (MISC) is computed based on the
assigned cluster label and the actual class label of the data objects, defined as,
66
MISC(X,V) = £l m (C(x i ),V(x i )), (3.21)
i=l
where C(xi) and V(xi) are the actual class and the assigned cluster labels of the data
object xi respectively. I m is the index function, defined as,
fl, x^y
i m (x,y)= : y .
[0, x = y
The cluster prototype error (PERR) measures the difference between the predicted and
the actual cluster prototypes of the dataset; it is defined as,
PERR(V) =
r c A
Z d ( v k> v I)
7 2
(3.22)
Vk=l
where Vk, the k th predicted cluster prototype, and v^ , the k th actual cluster prototype,
k=l,...,c. We applied the four strategies discussed above to the Iris dataset [140] using
different scales of the missing values, X M . The results are shown in Table 3.3.
The dataset contains 150x4 = 6000 attribute values. X M was generated randomly
at various scales from to 0.75. Each method was run 50 times. In Table 3.3, columns
25 show the average number of iterations to convergence, columns 36 show the
misclassification errors, and columns 710 show the mean prototype errors, computed
based on the distances between the estimated prototypes and the real prototypes of the
dataset for the four methods.
Based on the errors in columns 710, the accuracy of prototype estimation by
OCS was always at least as good as that of other approaches. The WDS performed well
for datasets with low percentages of missing values. The second best method was NPS.
67
Table 3.3 : Average results of 50 trials using incomplete Iris data [140]
%
Missing
Mean Number of Iterati
to Termination
ons
Mean Number of Misclassifications
(Training Error)
Mean Prototype Error
WDS
PDS
OCS
NPS
WDS
PDS
OCS
NPS
WDS
PDS
OCS
NPS
I
1
1
1
16
16
16
16
0.00
0.00
0.00
0.00
10
18.18
15.64
22.20
16.00
16.04
16,24
16.28
16.28
0.17
0.07
0.07
0.07
20
20.74
17,24
22,64
18.12
16.42
16.62
16.48
16.54
0J3
0.12
0.11
0.12
30
22.68
17.84
29.90
20.42
19,00
17.04
16.72
16.98
0.49
0.15
0.14
0.15
40
23,62
18.48
29.02
22.40
22.46
18.76
18.82
18.84
0,78
0.18
0.17
0.18
50
14,80
20.14
3306
29.08
42.64
2L44
21,14
21.36
2.11
025
0.21
024
15
N/A
47,20
68,92
7078
N/A
33.44
29.12
30.54
N/A
0.74
0.39
0,52
3.5 FCM Clustering solution validation
A fundamental and the most difficult problem in cluster analysis, cluster
validation, is to determine whether the fuzzy partition is of good quality. Cluster
validation addresses the following questions:
i. If all existing algorithms are applied to the dataset, and one obtains a multitude
of different partitions, which assignment is correct?
ii. If the number of clusters is unknown, partitions can be determined for different
numbers of clusters. Which partition (or number of clusters) is correct?
iii. Most algorithms assume a certain structure to determine a partition, without
testing if this structure really exists in the data. Does the result of a certain
clustering technique justify its application?
The cluster validation problem is the general problem of determining whether the
underlying assumptions (cluster shapes, number of clusters...) of a clustering algorithm
are satisfied for the dataset. There are three different approaches for cluster validation:
68
a) Using hypothesis testing on the clusters,
b) Applying model selection techniques to the cluster model,
c) Using cluster validity measures.
In the first two methods, statistical tests with assumption on probabilistic data
distribution model are used. Hence, cluster validation is performed based on the
assumed distribution. While the first approach, using hypothesis testing, is applied to a
single model that we consider the local validation approach (3.5.2), the approach using
model selection technique, similar to the global validation approach (3.5.1), takes into
account a family of probabilistic models according to a set of partitions from which the
best should be selected.
FCM is a possibilitybased approach. It is similar to the EM algorithm with the
mixture model, which is a probabilitybased approach. Compared with FCM, EM, with
a standard model assumption about the data distribution, has the disadvantage that it
converges slowly. However, EM, because of its assumed statistical model, has the
advantage of using the first two approaches to evaluate the cluster results. In contrast to
EM, these approaches are useless for FCM. Therefore, most methods for fuzzy partition
validation are based on the third approach, where cluster validity measures are used.
Lacking certain knowledge about the data, it has to be determined for each problem
individually which cluster shape is appropriate, what distinguishes good clusters from
bad ones, and whether there are data lacking any structure. There are two common
approaches to create cluster validity measures.
69
3.5.1 Global validity measures
Global validity measures are mappings g: A(D,R) — » R describing the quality of a
complete clustering partition using a single real value. The objective function, J m in
(3.3) is a simple example of such a validity measure. When we determine the cluster
parameters and memberships, we minimize the J m function. However, the objective
function is a poor choice for creating validity functions; it is obvious that lim(J m ) = 0,
but c = n is not the optimal number of clusters for all datasets. In recent literature, a
number of cluster validity measures have been developed and have addressed the
problem for specific situations [9297, 121131].
Definition 3.3 (Partition coefficient  PC, Bezdek): Let Af(R p , R cp ) be an analysis space,
X c: R p , and U: X^Vg R cp a fuzzy cluster partition. The partition coefficient PC of
the partition U is defined by
v pc =1 Z2X ( 3  23 )
n i=l k=l
The inequality — < V PC < 1 holds because of (3.2) and (3.6). Vpc = 1 for all hard
c
partitions. If the fuzzy partition contains no information, each object is assigned to each
cluster with the same degree and the minimum value 1/c of Vpc is obtained. A
maximum value of Vpc should be considered when making a selection between multiple
fuzzy partitions, i.e., V PC (U t ) = max V PC (U k ). The optimal partition, U op t, is the most
l<k<c
"unambiguous" one regarding cluster assignment of the objects.
70
Definition 3.4 (partition entropy  PE, Bezdek): Let A(R P , R cp ) be an analysis space, X
c= R P and U:X^Vg R cp a fuzzy partition. The partition entropy PE of the partition U
is defined by
V PE= 1 ZX u kl x ln K) ( 3  24 )
n
i=l k=l
Similar to the partition coefficient, we can show that < V PE < ln(c) holds for
any fuzzy partition U: X— >V. When the fuzzy memberships are uniformly distributed,
Vpe tends towards to ln(c). If U is the most ambiguous fuzzy partition, then the
maximum value of Vpe is obtained. The entropy, i.e., the mean information content of a
source which describes a correct partition, is 0, which is also obtained when U is a hard
partition. To find the best fuzzy partition, we therefore look for the partition with
minimum entropy, i.e., V PE ("U t ) = min V PE (U k ) .
l<k<c
Both PC and PE have the problem of overfitting in estimating the number of
clusters, c, because the use of the all memberships in the validity measure does not
necessarily increase the dependence on this parameter. This is why the PC and PE
factors tend to increase for large values of c and provide optimal decisions only on
unambiguous partitions. For a partition to be good, its clusters should satisfy the
compactness and separateness conditions. A partition contains compact and separate
clusters if any two objects from the same cluster have a smaller distance than any two
objects from different clusters.
71
Fukuyama and Sugeno [93] first proposed a simple approach to measure the
cluster separation using the geometry of cluster centers; the more the clusters scatter
away from the data central point, the corresponding solution is preferred.
Definition 3.5 (FukuyamaSugeno  FS cluster index): Let (R P , R cp ) be an analysis
space, X cz R p and U: X^V e R cp a fuzzy partition. The FukuyamaSugeno cluster
index of the partition U is defined by
V F s=JXIu h ra d 2 (v k) v), (3.25)
k=l i=l
where v = XIi v k/ c • ^ or eac ' 1 c l uster v k> 1^ k < c, the inequality d 2 (xi,Vk) < d 2 ( v,Vk)
holds for every data object x^ l<i< n. We, therefore, have Vfs ^ 0. V F s = when there is
one cluster in the fuzzy partition. The more the clusters are away from the central
prototype, v the more separated the clusters are. To find the best fuzzy partition, we
therefore look for partition with minimum Vfs
There are still the cases, however, where some clusters are far away from v but
still close to each other. Dunn [123] proposed a cluster index which measures both the
compactness and the separation of the partition. This cluster index, therefore, was
named the compact and separated cluster index. However, Dunn's index is only
applicable to hard partitions. To solve this problem, Xie and Beni [94] proposed their
cluster index using a separation factor where the distance between clusters is also taken
into account.
72
Definition 3.6 (Xie and Beni  XB cluster index): Let (R P , R cp ) be an analysis space, X c
R p and U: X^V e R cp a fuzzy partition. The XieBeni cluster index of the partition U
is defined by
n c
J,
EKd 2 (x„v k )
V XB = 2  5 = ^^ , (3.26)
nxSep(V) 2 nxD min (V) 2
l<k,l<c,k*l
where Sep(V) = D min (V) = min {d(v k , v { ) } is the minimum distance between cluster
centers of the partition. Since,
n c n c
ZZ u ki d2 ( x i> v k) ZZ u ki d2 ( x i> v k) * c Z u ki d2 ( x i> v k)
i=l k=l <> i=l k=l _ i X^ i=l
<
n
=rS
Y.< k = l IX
il i=l
c
^diam(v k ) 2
< ^^ < max{diam(v k ) 2 } .
c i<k<c' k/
We have V xp < —  , where DI is the Dunn compact and separated index;
DI= D  (V)
max{diam(v v )}
l<k<c k
Compared to DI, the Vxb index is a much simpler computation, and it takes the
membership degrees into account so that a detour via a hard partition is not necessary
[121]. However, the cluster indices, PC, PE, FS and XB do not consider the clusters
shapes and volumes. Gath and Geva, based on the concepts of hypervolume and
density, proposed three cluster hypervolume measures which are the basis of their
cluster validity index and the indices of many other researchers.
73
Definition 3. 7 (fuzzy hyper volume  FHV): Let (R P , R cp ) be an analysis space, X cz R p
and U: X— »V e R cp a fuzzy partition. The fuzzy hypervolume of the partition U is
defined by
k=l
n / n
where Fk, F k = ^u ki d 2 (x i? v k ) /^u ki is the fuzzy covariance matrix of Vk [125]. A
i=l / i=l
fuzzy partition can be expected to have a small value of Vfhv if the partition is compact.
However, Vfhv measures only the compactness of the partition. Rezaee et al. [95]
extended the Vfhv index by adding a term that measures the scattering between clusters
in the partition.
Definition 3.8 (Compose within & between scattering index  CWB, Rezaee et al): Let
(R P , R cp ) be an analysis space, X cz R p and U: X— »V e R cp a fuzzy partition. The
compose within and between scattering index [95] of the partition U is defined by
V CWB = aScat(V) + Dis(V) . (3.28)
Scat(V) measures the average scattering of c clusters in V and is defined as
V.
Scat(V) =
'FHV
C X CJ«
( x )
1 n II —
where ^(X) = — ^ x i  x
n
i=l
— — 1 n
and x, x = — Vxj is the center point of X. Dis(V) measures the scattering between c
clusters of V and is defined as
74
c f c ^"'
Dis(V) = ^XXIK v
mm
k=l
111
V 1=1
The first term of Vcwb, i.e., Scat() in (3.28), indicates the average of scattering
variation within the c clusters of the partition. The smaller this term is the more compact
the partition is. However, this term does not take into account any geometric
assumptions of the cluster prototypes. The second term of Vcwb, Dis() indicates the
average scattering separation between clusters. It is influenced by the geometry of the
cluster centers and will increase if the number of clusters increases. For the best
partition, we look for the partition with the minimum value of Vcwb
Definition 3.9 (Pakhira, Bandyopadhyay and Maulik fuzzy cluster index  PBMF): Let
(R p , R cp ) be an analysis space, XcR p and U: X>V e R cp a fuzzy partition. The PBMF
index [96] of the partition U is defined by
v
V PBMF
'IE
j max
V c J i J
(3.29)
where Ei = na(X) which is a constant for each dataset. The term Ji measures the
compactness, while the term D max measures the scattering separation of the clusters in
the partition. A good partition must have either a small Ji or a large D max . To find the
best partition, we therefore look for the partition with the maximum value of Vpbmf
Definition 3.10 (Babak Rezaee  BR cluster index): Let (R P , R cp ) be an analysis space, X
c R P and U: X>V e R cp a fuzzy partition. The BR index [97] of the partition U is
defined by
75
V =
V BR
Sep(V)
• + ■
max{Sep(V)} max{J 2 }
(3.30)
where Sep(V) =
c ( c 1) k,l=l,k*l
^C REL (v k ,Vj). Instead of using geometric distance between
the two centers as the distance of every pair of clusters, Rezaee [97] used the relative
degree of sharing or connectivity, Crel of the two clusters, defined as in (3.31).
To compare the performance of the global cluster validity indices from (3.23) to
(3.30), we chose the three previously used real datasets: Iris, Wine, and Glass and the
artificial dataset ASET2 with 200 data points in five wellseparated clusters (Figure
32).
110
100
90
80
70
60
50
40
30
20
10
it.
:*%
fc
>.fc"
20
20
40
60
80
100
120
Figure 32 : ASET2 dataset with five wellseparated clusters
76
For each of the four selected datasets, we ran FCM with the number of clusters
set from 2 to 10, each partition, corresponding to a cluster number, was evaluated using
the cluster validity measures. The results are shown separately for individual validity
measures in Figures 33  39.
0.9
0.8
0.7
0.6
0.5
1
0.8
0.6
\
0.4
0.2
1 '
1 ■
i i i i
2 3 4
5 6
7 8 9 10
Iris
• Wine — 
Glass ASET2
Figure 33 : PC index (maximize) Figure 34 : PE index (minimize)
The two indices, PC in (3.23) and PE in (3.24) behave very similarly; when PC
decreases (Figure 33), then PE (Figure 34) increases and vice versa. These two
measures, however, prefer partitions less ambiguous; they correctly identified the
number of clusters in the artificial dataset, ASET2, but on the real datasets, where the
clusters significantly overlap, they failed.
10 n
2
^ ^
1.5
1
20 
0.5
2 3
4
5
6
7
8 9
10
Iris
•• Wine
Glass
ASET2
Figure 35 : FS index (minimize)
Figure 36 : XB index (minimize)
77
500 i
,*
25 n
20 
15 
10 
5
,,"'
''J^*^
/\. / ^^
*^^ ^"»
;<.
/•*•*"**"
^—  ^ v.
ssus*:J.
— w..
wm%pm ** *»* — . «.
i i i
'345
— — — Iris Wine
i i i i i
6 7 8 9 10
Glass ASET2
3 4
 Iris
5 6
• Wine
i i i i
7 8 9 10
Glass ASET2
Figure 37 : CWB index (minimize)
Figure 38 : PBMF index (maximize)
■0,3
Figure 39 : BR index (minimize)
The other validity measures also correctly identified the cluster number in the
ASET2 dataset. However, with the exception of PBMF with the Iris dataset (Figure
38), all failed to detect the correct cluster number in the real datasets.
For further comparisons between these validity measures, we used 84 artificial
datasets [147] generated using a finite mixture model as in Section 2.5.1. Datasets are
distinguished by the dimensions and cluster number, and we generated (32+l)*(9
3+l)=14 dataset types. For each type, we generated 6 datasets, for a total of 6*14=84.
For each artificial dataset, we ran the standard FCM five times with m set to 2.0. In each
case, the best fuzzy partition was then selected to run the validity measures to search for
the optimal number of clusters between 212 and to compare with the known number of
78
clusters. We repeated the experiment 20 times and averaged the performance of each
method. Results of these comparisons are shown in Figure 310.
o
1^
u\
QJ
4'
0/
3
0.9
0.8
0.7
0.G
0.5
0.4
0.3
7x!
. . ^ <r v>
\
? \*v
/>",V
/ \\
— \ *•*" / , — \
— /—j ■«■ — * — \ / + \ * — «• ^\
lu v ^ v
hi X V V V' %   XB
V,1.
^L
CWB
PBMF
BR
f
T
3 4 5 6 7 8 9
Number of clusters
Figure 310 : Performance of the global validity measures on artificial datasets with
different numbers of clusters
The PC measure provides better results than the PE measure. Measures using the
geometry of the prototypes and the geometric distance measures i.e., PBMF, CWB and
FS, have much better performance than those of validity measures not using prototype
geometry. However, using geometric information for measuring the partition
compactness does not help validity measure with better performance. The cluster
compactness should be measured using the memberships. This is reasonable because the
data do not always distribute equally, and the cluster volumes, therefore, may differ
from cluster to cluster; it is not necessary that any two data points having the same
79
memberships should have the same distance to the centers of clusters to which they
belong. The selection of validity measures therefore should be specific to the individual
application.
3.5.2 Local validity measures
Determination of the optimum cluster number using global validity measures is
very expensive, because clustering must be carried out for multiple cluster numbers. If a
perfect partition cannot be recognized by a single run, we could at least filter the
presumably correctly recognized clusters [121]. A validity measure, called a local
validity measure is, therefore, necessary to evaluate clusters individually. This validity
measure helps to correct the partition more directly. The concept of local validity
measures is that, by running FCM with a possibly maximized cluster number [122] and
analyzing the resulting partition, we provide an improved initialization for the next run.
We continue this procedure until finally the whole partition consists only of good
clusters. If the final partition with only good clusters is found, we have determined the
optimal number of clusters and the best assignment of data points to clusters. Hence, in
addition to addressing the problem of cluster validation, the local validity index helps
with the problem of parameters initialization.
Definition 3.11 (Relative degree of connectivity of two clusters  Crel)' Let (R , R cp ) be
an analysis space, XcR p and U: X^V e R cp a fuzzy partition. The relative degree of
connectivity of the two clusters Vk, vi e V is defined as [97, 127]
n
C REL (v k ,v 1 ) = J]min{u ki ,u li }xe i , (3.31)
i=l
80
where q { =^u ti ln(u ti ). ei is the entropy of the cpartition at xi g X and has a
t=i
maximum value of ln(c) when Xi has the same membership degree to all clusters. Hence,
we have 0<C REL (v k ,v 1 ) < n n ^ c y . The two clusters Vk, vi are well separated as
Crel^O.
Definition 3.12 (Intercluster connectivity  Cic): Let (R p , R cp ) be an analysis space, X cz
R P and U: X^V e R cp a fuzzy partition. The intercluster connectivity [129] of the two
clusters Vk, vi e V is defined as
C lc (v k ,v 1 ) = RF kl  + RF lk , (3.32)
where RF kl = { Xi g X  u kl = max{u tl },u H = max{{u tl } Kt , c \ {u kl }} ).
The Cic(vk,vi) shows the overlap region between clusters Vk and vi. If the two
clusters are well separated then Cic(vk,vi) — » 0. A greater Cic is an indication of a
greater degree of the similarity between Vk and vi. If Cic(vk, vi) is larger than a selected
threshold, say 0.5, then the two clusters Vk and vineed to be combined.
Definition 3.13 (fuzzy inclusion similarity  Sfi)' Let (R , R cp ) be an analysis space, X c:
R P and U: X^V e R cp a fuzzy partition. The fuzzy inclusion similarity [130] of the two
clusters Vk, vi e V is defined as
Z min K> u i
i=l
f n n
S FI (v k ,v,) = ^ ir ^ r . (3.33)
. i=l i=l
81
Definition 3.14 (linkbased similarity  Slb)' Let (R P , R cp ) be an analysis space, XcR p
and U: X^V e R cp a fuzzy partition. The linkbased similarity [131] of the two clusters
Vk, vi g V is defined as
c
E S Mir( V k, V l)
SlbK,v 1 ) = ^— ^ , (3.34)
where
S!„ dir (v k ,v 1 ) = min{S Dir (v k ,v t ),S Dir (v t ,v 1 )},
S^=max{S LB (v k , Vl )},
l<k,l<c
and
%ir(^^l) =
 V k nV l
kuvJ
Because Vk and vi are fuzzy sets, according to the fuzzy set theory, we have
n
^min{u ki ,u H }
S Dir (v k ,v 1 ) = ^ .
Z max Ki> u ii}
i=l
The cluster merging technique evaluates clusters for their compatibility
(similarity) using any of the local validity measures discussed above and merges
clusters that are found to be compatible. In every step of this merging procedure, the
most similar pair of adjacent clusters is merged if their compatibility level is above a
userdefined threshold y, and the number of clusters is therefore gradually reduced.
82
Either the compatible cluster merging (CCM) algorithm [126, 127, 128] or the
agglomerative HC algorithm [132] can be used for this process.
In the case of Iris dataset, we first ran FCM with the number of clusters set to
12, which is equal to LVnJ, where n=150, and is clearly greater than the optimal number
of clusters. The HC algorithm was applied with the weighted linkage used as a local
validity measure to merge adjacent compatible clusters. In Figure 311, it can be seen
that most of the clusters were merged into two large groups and the third group contains
only one cluster.
2 7 6 3 9 4
11 1 5 10 12
Figure 311: Dendrogram of the Iris dataset from a 12partition generated by FCM
For the Wine dataset, we ran FCM with the number of clusters set to 13, which
is equal to LVnJ, where n=178, and is larger than the optimal number of clusters. The
HC algorithm was applied with the weighted linkage. It can be seen in Figure 312 that
the clusters gather into three groups.
83
1.2
0.8
0.6 
0.4
0.2
9 10 3
12 5 7 11 1 4 2 6 13
Figure 312: Dendrogram of the Wine dataset from a 13 partition generated by FCM
By considering only the highest levels of the dendrograms as in Figure 311 and
Figure 312, the number of clusters in the Iris and Wine datasets can be identified
correctly. However, in the fuzzy cluster analysis, several prototypes can share one
cluster. A local validity measure based on memberships can only identify a prototype
pair instead of all of these prototypes. For example, in the case of the Iris dataset, the
clusters 2, 3, 7, 8 and 12 share the class Virginica, but only cluster 12 was separated to
form a final cluster. For the Wine dataset, the clusters 1, 2, 3, 4, 6, 9 and 10 share the
overlap region between classes 2 and 3 but only the clusters 9 and 10 are grouped to
form a separate cluster. To solve this problem, the best cluster is removed after each
merging step and the distances among the rest are recomputed [126132]. This
84
procedure is repeated until no further good clusters are found or the compatible level is
under the threshold value y [128].
3.6 FCM partition matrix initialization
Theorem 3.4 shows that FCM algorithm is very sensitive to the initial value, U°,
of the partition matrix. The determination of U°, or the initialization of the partition
matrix, can help the algorithm to converge quickly to the global optimum or get stuck in
local optima. A common approach in most partitioning clustering methods is to
generate the partition matrix randomly. However, this requires different mechanisms for
different datasets because of differing data distributions.
120
100 —;++!■
OU
60
40
I
+++* i    +. + +X
* + * * ■ * *!******
5 * 4 ;
♦ ♦♦J
&~ +
* \ 2
+ t
6 + y
i
*
3 *
+
+
**+ * +
20
40
60
sc
100
2:
Figure 313 : ASET1 dataset with 6 clusters
85
We used the artificial dataset 1 (ASET1) as an example. This dataset contains 171
data points in six clusters (Figure 313) generated using a finite mixture model as
described in Section 2.5.1. We ran the three standard clustering algorithms: kmeans, k
medians and FCM with the number of clusters set to 6, and the partition matrix
generated using a nonuniform random method. The cluster partition found by each
algorithm was compared with the classification structure of the dataset. We repeated the
experiment 20 times and averaged the performance of each algorithm, as shown in
Table 3.4. Because of the random value of the U°, these algorithms could not always
converge to the local optimum. Hence, the correctness ratios for every class and the
whole dataset are sometimes off.
Table 3.4 : Performance of three standard algorithms on ASET1
Algorithm
Correctness ratio by class
Average Ratio
1
2
3
4
5
6
kmeans
0.97
0.87
1.00
1.00
1.00
0.75
0.93
kmedians
0.95
0.82
1.00
1.00
1.00
0.62
0.90
FCM
0.97
1.00
0.95
1.00
1.00
0.96
0.98
An alternative approach is to search for the best location of the cluster candidate
using data points in the dataset. An integration of FCM with optimization algorithms
such as GA and PSO is preferred. To manage a solution set, of which each solution is a
set of data points selected as cluster candidates, the optimization algorithms try different
values of the partition matrix. In this way, they solve the problem similarly to the
method of initializing the matrix randomly, as we mentioned previously, which may be
appropriate for some datasets, but not for others.
86
A simple and yet effective method for the problem of initializing the partition
matrix is subtractive clustering (SC). The subtractive algorithm, originally developed by
Chili [63], is a notable improvement to FCM, in which an improved mountain method
was proposed to determine the number of clusters and estimate their initial values.
Instead of generating the cluster candidates randomly, the SC method searches for the
most dense objects in the dataset. Li et al. [64], Liu et al. [65], and Yang et al. [67] used
SC to locate the most dense data points as cluster candidates. The partition matrix is
then computed using the selected data points as cluster centers. However, the SC
method requires the mountain peak and mountain radii numbers to be specified a priori,
which may differ among datasets. Yang et al. [66] addressed the problem by a method
that automatically computes these values from the dataset. However, the data
distribution may not be identical among clusters, and the estimated parameters may be
appropriate to some clusters but not others. Cheng et al. [68] proposed a weighted SC
method that makes the cluster candidate selection more flexible by allowing the cluster
centers to be located at the means of the dense data points instead of selecting the most
dense data points themselves. This approach can help FCM work effectively. However,
it still requires that all the SC parameters are specified a priori.
3.7 Determination of the number of clusters
FCM has a disadvantage that the number of clusters must be known in advance.
When this is not known, the results with different numbers of clusters have to be
compared using a validity function in order to find an optimal fuzzy partition with
87
respect to the new objective function. There are two common methods to address the
problem of number of clusters using this approach [121].
3.7.1 Betweenclusterpartition comparison
This method defines a global validity measure which evaluates a complete cluster
partition. An upper bound c max of the number of clusters is estimated, and a cluster
analysis using FCM is carried out for each number of clusters, c, in {2,...,c max }. For
each partition solution, the validity measure generates a validity index which is then
compared with the indices of other partition solutions to find the optimal number of
clusters.
0.95
Number of clusters
Figure 314 : PC index on the ASET2 dataset (maximize)
For the case of ASET2 dataset, we used Vpc, for example, as global validity
measure, and set c max to 14 which is LVnJ, where n=200 the size of ASET2. FCM
algorithm was run with the number of clusters set from 2 to 14; each partition,
corresponding to a certain number of clusters, was evaluated using Vpc The results are
88
shown in Figure 314. To find the optimal number of clusters, we look for the partition
with the maximum value of Vpc (Definition 3.3). Therefore, five is the optimal number
of clusters, and is also equal to the number of classes in ASET2. Hence, in the case of
ASET2, Vpc correctly detected the number of clusters.
3.7.2 Withinclusterpartition comparison
This method defines a local validity measure that evaluates individual clusters of
a partition. An upper bound c max of the number of clusters is also estimated and a cluster
analysis is carried out for c=c max . The resulting clusters are then compared using the
validity measure. Similar clusters are merged into one and very bad clusters are
eliminated. These operations reduce the number of clusters in the partition. Afterwards,
a cluster analysis is carried out on the remaining clusters. This procedure is repeated
until the analysis result no longer contains either similar clusters or very bad clusters.
Both 3.7.1 and 3.7.2 are based on the validity measures that we discussed in 3.5.
Hence, the problems with using cluster validity measures may cause these methods to
remain in local optima when searching for the "best" number of clusters; they assume
that the cluster analysis results are optimal for the given numbers of clusters. However,
FCM does not guarantee this [132]. Hence, an optimal cluster partition found by these
methods is possibly only a local optimum. In general [121], to evaluate a possibilistic
cluster analysis, we must consider both the quality of the clusters and the proportion of
the classified data. This is especially difficult with quality measures for individual
clusters according to the method in 3.7.2 [121]. In this case, the method in 3.7.1 is
therefore recommended.
89
3.8 Determination of the fuzzifier factor
The performance of FCM is affected by an exponent parameter, called the
fuzzifier factor, m. This parameter affects both the convergence rate and the cluster
validity of the algorithm.
m=l, FCM behaves like KMeans, say all clusters are disjoint.
m^oo, we have
limiu
= lim
= lim
ml
d(x,>v k ),
1=1
ml
1=1
^ d 2 ( Xi ,v k ) V'
v d 2 (x i ,v ] ) y
.^(Xf.vOy
=  for 1 < k < c and 1 < i < n,
c
lim{v k }=lim 2«Sx,/S»S
L i=l / i=l
n
2>. _
i=l
■ = x for 1 < k < c,
where x is the grand mean of X. Thus, when m goes to infinity, all the data points are
considered to belong to only one cluster. Wu [144] ran FCM on the Iris dataset with
different values of m (m=l.l, 1.5, 2, 3, 4,. . ., 50) and the number of clusters set to 3, and
measured the misclassification rate in each cluster partition as shown in Figure 315.
As m becomes large, the FCM algorithm becomes stable and no improvement can
be achieved, even with additional iterations. If Ue M fcn , the constraint as in (3.6) makes
it difficult to interpret Uki as the presentative of the k th cluster at xi. This is a very
90
important problem with FCM, and any application of FCM should take the
determination of m into account.
17 —
16 —
CO
O
o
15 —
1—
o
1—
1—
14 —
13 —
12 —
m=L5
m=2
frQ g e»frg»frg « »»frggg p egqgqeg
* **<■ < ■<
T
10
20
30
40
50
m
Figure 315 (Wu [144]) : Impact of m on the misclassification rate in Iris dataset
To date, there are no reliable criteria for the selection of the optimal fuzzifier for a
given set of training vectors. Instead, the best value, with respect to a given value range
of m, is fixed through experimentation; i.e., this choice is still heuristic. To our
knowledge, only a few methods exist to determine an optimal value of m. Romdhane et
al. [57] proposed a method to estimate the value range that should contain the optimal
values of m, by defining the limits of the overlap ratio between clusters. An iteration
process is then carried out until the optimal value is reached. This approach has
provided an easy yet powerful way to estimate the value of m. However, it has the
drawback that its optimal solution is strongly dependent on the initial random value of
m. Dembele et al. [60] developed a method to estimate the value of m using the number
of clusters and the estimated membership. The key idea of this method is that as long as
m runs toward to its upper bound value, the membership of all data points will converge
91
to 1/c, where c is the number of clusters. They estimate the upper bound value of m
using this idea. This method does not have the problem of local optima caused by the
use of an initially random value of m. However, it requires the number of cluster to be
specified a priori. To address this issue, Schwammle et al. [61] proposed a method
using cluster validity indices to estimate the value of c before determining the optimal
value of m. Similarly, Zhao et al. [62] proposed a modelbased approach for the m
estimation using the predefined value of c, which varies from 2 to C where C is the
possibly maximum number of clusters. The estimation methods of these two approaches
are based on the fuzzy partition generated using the predefined value of c. This value is
not guaranteed to be optimal, because most of the cluster validity measures are not
appropriate to the real datasets [147]. The fuzzy partition generated by the FCM
algorithm, using the selected number of clusters, is also not optimal because FCM may
remain in local optima.
3.9 Defuzzification of fuzzy partition
Defuzzification in fuzzy cluster analysis is a procedure to convert the fuzzy
partition matrix U into a crisp partition which is then used to determine the
classification information of the data objects. The most popular approach for
defuzzification of fuzzy partition is application of the maximum membership degree
(MMD) principle [148, 152, 153]. The data object Xi is assigned to the class of Vk if and
only if its membership degree to cluster Vk is the largest, that is
u kl =max{u H }. (3.35)
1=1. .c
92
This approach may be inappropriate in some applications, because FCM
membership is computed using distance between the data object and cluster center. The
use of membership degree can assign marginal objects of a large cluster to the
immediately adjacent small cluster. This is illustrated in Figure 316, where if a data
object is in the gray rectangle, it may be incorrectly assigned to cluster 3 instead of
cluster 2.
50
40
30
20
10
 1 1 1 
+ +
*o*
+ +
♦ ♦ + +H +
* * + + + +
+ + +
20 30 40 50 60 70 80 90 100 110 120
Figure 316 : Artificial dataset 5 (ASET5) with three clusters of different sizes
Genther, Runkler and Glesner [151] proposed a defuzzification method based on
the fuzzy cluster partition (FCB), where the partition information is applied in
membership degree computation. For each data object Xi, i=l,...,n, an estimated value,
(Li(xi), is computed using the fuzzy partition. The set of estimated values of X,
{n(xi)}i=i,...,n is then used to compute the estimated value of Vk, Wk, k=l,. . .,c, as:
w k =s^^i)/Z u S
(3.36)
i=l
i=l
The distance between the data object and cluster center, d F , in FCB is computed
using both the actual and the estimated information as:
93
d F (x, , v, )  nd ' ( f" v '> + Pd;< f (Xi) ' W ' ) , (3.37)
\ max{d (x,v } max{d (^Xj^Wj)}
V j,i j,i
where d?() is used, instead of d(), to compute the membership degree. The two
parameters, a and (3, weight the distances from X and its estimated to cluster prototypes.
Chuang et al. [150], in the Neighborhood Based Method (NBM), on the other hand
proposed adjusting the membership status of every data point using the membership
status of its neighbors. Given a cluster Vk, k=l,...,c, a data point, xi, i=l,...,n, receives
an additional amount of membership degree from its neighbors:
K = 2X ( 3  38 )
XjeNBCxi)
This additional information is used in computing the membership degree, as in
(3.39), so that the membership level of each data point is strongly associated with those
of its neighbors that will help with determination of classification information. Here,
U NB = M*(KY (339)
Z(uH) P x(h H ) q
1=1
Similar to FCB, the NBM method requires a definition of neighborhood boundaries
and the values of p and q which weight the position and spatial information. Chiu et al.
[149] also used the spatial information to compute the distance between data point and
cluster center in their sFCM (Spatial FCM) method, as in (3.40),
i nb r n
d s ( Xi ,v k ) = — £[(la i j )d(x i ,v k ) + oid(x j ,v k )], (3.40)
94
where NB is the number of neighbors of every data point, a is the weighted factor of
Xj, a neighbor of xi, defined as:
aj = l/{l + ex V (A)/o i )), (3.41)
where A^ = d(x j ,x i ). sFCM uses ds(), instead of d(), to compute the membership
degree, and updates cluster centers, { v k }, k=l,. . .,c, as follows:
v^
=2>£
1 NB
YOoOXj +a i i x i
K . (3.42)
Algorithm performance measures
The overall performance of defuzzification methods can be measured based on the
number of data objects that are misclassified, defined as in (3.21); the cluster label of
each data object is compared with its actual class label. If any do not match, then a
misclassification has occurred. The compactness measure, defined as in (3.43), can be
used to validate how the actual classification structure retains as the missing data are
updated,
c N k
compactness = ^^d(x i9 v k ). (3.43)
k=l i=l
We tested the four defuzzification algorithms: MMD, FCB, NBM and sFCM using
the two datasets ASET4 and ASET5 with various values of the fuzzifier factor, m. Each
algorithm was run 100 times and the results were averaged. The number of neighbors
was set to 8. For the ASET4 dataset which contains 300 data objects of 11 well
separated clusters in a 5demensional data space, the result is shown in Table 3.5.
95
Table 3.5 : Algorithm performance on ASET4 using MISC measure
1.25
1.375
1.5
1.625
1.75
1.875
2
MMD
0.00
0.00
0.00
0.00
0.00
0.00
0.00
FCB
12.24
12.24
12.24
12.24
11.72
11.71
11.97
NBM
5.89
5.56
1.97
1.97
1.94
2.03
1.80
sFCM
12.23
9.60
0.30
0.75
0.84
0.84
0.84
In the Table 3.5, only the MMD method correctly detected the classification
information across various values of m. The other methods performed worse,
particularly the methods that used spatial information to revise the distance between
data object and cluster center, FCB and sFCM. That may be because the assumption of
8 as the maximum number of neighbors for each data point, which is correct in spatial
space, is not appropriate for generic data space.
We ran the same benchmark on the ASET5 that contains 150 data points of 3
clusters different in size, see Figure 316. The result is shown in Table 3.6.
Table 3.6 : Algorithm performance on ASET5 using MISC measure
algo^"\
1.25
1.375
1.5
1.625
1.75
1.875
2
MMD
16.00
16.00
16.00
16.00
16.00
16.00
16.00
FCB
4.32
4.32
4.59
4.59
4.32
4.32
4.32
NBM
27.19
27.19
27.19
27.19
27.19
27.19
27.19
sFCM
17.00
17.00
17.00
17.00
16.00
16.00
16.00
The FCB algorithm performed better than the other methods. The methods that use
spatial information in computing distance between data object and cluster center, e.g.,
96
FCB and sFCM provided better results. However, these algorithms failed at producing
classification information for this dataset.
The test results have shown that it is not appropriate to apply the spatial
information into the defuzzification problem in cluster analysis of generic data, because
properly determining the number of neighbors for each data point is complicated, and
the choice of the two parameters, a and (3, in these methods may not be easy due to the
data distribution difference among different datasets.
97
4. Methods
4. 1 Microarray metaanalysis
4.1.1 Selection of samples and features
Key challenges in metaanalysis methods arise from the small sample sizes
typical of microarray studies and variations due to differences in experimental
protocols. Both affect the results from individual studies and inevitably degrade the
final results of metaanalysis. In addition, the main outcome of these approaches is the
ranking of genes for significance, however, it is still unknown which feature, gene
expression value, statistic test pvalue, likelihood measurement, or any other statistical
measures, best represents of outcome measures. This chapter addresses the problem of
choosing samples from multiple studies with the goal of reducing the variance and
discovering more biological meaning. In addition, because microarray data are specific
to the biological context of the experiment, in the solution proposed, the metaanalysis
will reflect the appropriate biological context.
4.1.2 Sample and feature selection method
Common approaches for sample and feature selection are PCA, hierarchical
clustering analysis, optimization algorithms and ranking methods. The first three
approaches are limited because their results are strongly affected by the lab in which the
experiments were carried out [25, 49]. These approaches are also computationally
complex, because they use pair wise sample distance measurements. When the number
of probes or features is large, this process becomes time consuming, particularly with
98
the use of external metrics. Lastly, these methods require the data to be obtained using
the same microarray platform.
The alternative approach is the ranking method. Because ranking itself is
invariant to withinarray monotonic data transformations, by applying the rankbased
TSPG (TopScoring Pair of Groups) method, which performs onchip comparisons
within each microarray, no data normalization or transformations are required before
integration [169].
Feature ID mappings / Selection
For microarray data obtained using the same generation of the same platform,
we directly merge individual data sets using probe sets common to all experiments.
To merge data from different generations of the same microarray platform, we
first use the Affymetrix probeset ID mapping to annotate all probesets. Probes that are
not annotated will be processed using EntrezGene with UniGene data. Probes having
two or more gene IDs will have their data replicated, one for each unique gene.
To merge data from different platforms, we first map all probe identifiers to
gene identifiers. Probes having two or more gene IDs will be preprocessed as above.
This method avoids excluding probes from further analysis. In this study, we focus on
integrating microarray data from the CMAP datasets which were generated using the
same platform but different generations. Specifically, we integrate microarray data
generated from the Affymetrix HGU133x and HGU95x platforms to obtain data sets
with large (>500) sample sizes. This integrated data set is used to discover common
gene expression signatures in response to drug treatments [170].
99
Sample selection
We developed RPPR algorithm, a combination of the rank product (RP)
statistics method with a modified version of the PageRank™ (PR) algorithm and the
goodnessoffit KolmogorovSmirnov statistical test method. The RP method generates
the gene rank for gene expression profiles. The PR algorithm is then applied to augment
the gene ranks using ProteinProtein interaction and GO Biological Process annotation
data. The KS statistical test method is used to filter the expression profiles of which the
gene ranks best describe a given gene expression signature.
Given a sample expression profile X and a gene expression signature S, denote
S u and Sd as the lists of upregulated and downregulated genes, respectively. The
connectivity score between X and S is computed using the KolmogorovSmirnov scores
of goodnessoffit (KS) of the two sub lists in S to the profile X. The KS score of a tag
list s to X is calculated as follows:
Sort X based on the appropriate expression status of genes.
Denote by t and n the lengths of s and X, respectively. Construct a vector O of
the position (1,. . .,n) of each probe set of s in X and sort these components in ascending
order such that 0(j) is the position of tag j, where j = 1, 2,. . ., t. Compute the following:
a = max
l<i<t
O(i)
n
b = max
l<i<t
O(i) (i1)
n
t
(4.1)
next
KS(s,X) =
a, a >b
b, a<b'
(4.2)
100
For each expression profile associated with a specific drug, X p , p=l,...,P, where
P is the number of profiles, let
KS^ p =KS(S u ,X p ),
KS p own =KS(S d ,X p ).
The KS connectivity score of the profile X 1 to the gene expression signature S is
defined as
s'=KS up KS down . (4.3)
Let
SL=m P ax(s'), SL=mW).
1=1 1=1
The normalized connectivity score of the profile X p to the gene expression
signature S is defined as:
S p =
0, sign(KS;) = sign(KS p own )
s p /SL, s p >0 . (4.4)
s p /sL, s p <0
Because RPPR uses only the preprocessed expression data, it is also appropriate
for datasets generated using next generation sequencing, RNAseq and other high
throughput technologies.
Method testing
We tested RPPR using the CMAP datasets to detect expression profiles strongly
related to the expression signatures, SI and S2, of two drugs, the Histone deacetylase
101
(HDAC) inhibitor (SI) and estrogen (S2). We compared the performance of RPPR with
those of Lamb et al. [25] and Zhang et al. [26].
HDAC inhibitor signature
CMAP arrays were preprocessed using RMA resulting in 564 expression
profiles. The RP [29] method was then applied to every study to detect genes
differentially expressed and to order genes in each profile. The pattern matching
ranking method was used to score all the profiles with respect to the HDAC inhibitor
signature [25]. We used a cutoff of 0.95 for the normalized connectivity scores to select
profiles strongly related to the signature. Results, compared with those of Lamb et al
[25] and Zhang et al [26], are shown in Tables 4.1 and 4.2, and Figures 41 and 42,
respectively.
Table 4.1 : Predicted HDAC antagonists
(1: Detected; 0: Not detected)
Drags
Lamb
Zhang
RPPR
valproic acid
1
trichostatin A
1
Resveratrol
HC toxin
1
Vorinostat
1
Wortmannin
1
5666823
Prochlorperazine
1 7allylaminogeldanamycin
102
Table 4.2 : Predicted HDAC agonists
(1 : Detected; : Not detected)
Drags
Lamb
Zhang
RPPR
estradiol
1
1
genistein
1
1
nordihydroguaiaretic acid
1
1
thioridazine
LY294002
staurosporine
indometacin
1
1
alphaestradiol
1
1
tretinoin
doxycycline
1
troglitazone
1
5149715
1
12,13EODE
1
1 7allylaminogeldanamycin
butein
wortmannin
tetraethylenepentamine
diclofenac
4,5 dianilinophthalimide
pirinixic acid
haloperidol
103
Lamb
Zhang
/^
"x^
\
( 1
I ° \ 3
)
/o
v
/
RPPR
Figure 41 : Venn diagram for predicted HDAC antagonists
Lamb
Zhang
•""""'"
^\,
y^
y
3
(°
X
\
8
\
^ /T
\
7
Y
;
r
\
4
/
\^
^^y
=IPPR
Figure 42 : Venn diagram for HDAC inhibitors
Figure 41 illustrates results of the three approaches using the HDAC inhibitor
signature (SI). The results of RPPR are similar to those of Lamb et al. [25] because
both used the same scoring method. Vorinostat, trichostatin A and valproic acid, well
known HDAC inhibitors, were all identified by RPPR. HCtoxin was not identified by
Lamb et al. [25] or Zhang et al. [26], but was identified by RPPR. Although 17
allylaminogeldanamycin, wortmanin, prochlorperazine were not reported in the top 20
profiles from RPPR, they were ranked with high scores: 0.639, 0.557 and 0.657,
104
respectively. It is interesting that only RPPR detected resveratrol which has recently
been recognized as an HDAC inhibitor like valproic acid [119].
Estrogens signature
Table 4.3 : Predicted estrogen receptor antagonists
(1 : Detected; : Not detected)
Drugs
Lamb
Zhangmine
RPPR
Tamoxifen
Fulvestrant
1
trichostatinA
Raloxifene
1
Monastrol
1
5224221
1
Oxaprozin
HCtoxin
Vorinostat
1
LY294002
1
Demecolcine
5248896
Y27632
1
Sirolimus
1
1 7allylaminogeldanamycin
1
Monorden
1
Sodiumphenylbutyrate
1
Deferoxamine
1
Indomethacin
1
Tretinoin
1
105
Lamb Zhang
/3j5</0
J
2
j
RPPR
Figure 43 : Venn diagram for predicted
estrogen receptor antagonists
Lamb^
^.
Zhang
1
(°
x ^ 3
/o
X
:7 o\
7
RPPR
Figure 44 : Venn diagram for predicted
estrogen receptor agonists
For the Estrogen signature (S2), Figures 43 and 44 show that results of RPPR
are similar to those of Lamb. It is interesting that both RPPR and Zhang's identify
estradiol with the highest rank, but only RPPR detected tamoxifen as negatively related
to the estrogen signature. In fact, tamoxifen is a compound that works by blocking the
estrogen receptors on breast tissue cells and slowing their estrogeninduced growth.
Tamoxifen has recently been chosen as one of the best replacements for fulvestrant in
breast cancer treatment. The results of RPPR significantly agreed with those of Lamb et
106
al. (2006) and Zhang et al. (2008). Some of RPPR's results which were new to those of
Lamb et al. (2006) and Zhang et al. (2008) were also proved by recent literature.
Table 4.4 : Predicted estrogen receptor agonists
(1 : Detected; : Not detected)
Drugs
Known target/mechanism
Lamb
Zhang
RPPR
Valproic acid
Epilepsy treatment, ER stimulator
1
1
trichostatinA
HDAC inhibitor, ER sensitizer
1
1
HCtoxin
HDAC inhibitor,
ER signaling pathway related
1
1
vorinostat
Skin cancer treatment,
Advanced stage breast cancer related
1
1
wortmannin
Longterm estrogen deprivor (usually
used with tamoxifen for a better ER
inhibitor)
5666823
Similar to 2deoxyDglucose (0.7765)
like Tamoxifen (0.7143)
1
prochlorperazine
Nausea and vomiting treatment,
Endocrine: not used with hormone
replacement estrogens.
1
17allylamino
geldanamycin
Hsp90 inhibitor, Hsp90 is required for
the proper functioning of ER
1
Although RPPR used the same goodnessoffit scoring method as Lamb et al.
[25], it is based on different distribution functions generated by the different methods to
detect and rank genes that are differentially expressed. In addition, instead of using the
same distribution function for both upregulated and downregulated genes, RPPR used
the upregulation distribution of the drugspecific profile for the upregulated genes and
the downregulation distribution of the profile for downregulated genes. This helps to
address the weakness of KS goodnessoffit test because it is less sensitive at the tailend
of the distribution. In restricting the range of the distribution function, the results
107
produced by RPPR are very close to those of Lamb et al. [25] and Zhang et al. [26]
although no change was made to the step function part of the distribution function. This
restriction helps speed up the scoring process when the distribution function can be
approximated well by managing the step function.
The experimental results in this study also show that, although CMAP datasets
are generated from unreplicated single arrays, the use of metaanalysis with the rank
product and page rank methods, and biological knowledge reduces the variance of the
selected samples. The results therefore have strong agreement with the biological
context of interest; e.g., the HDAC inhibitors and Estrogen receptor.
4.1.3 Methods for feature metric
One of the most important issues with metaanalysis is how to define a
similarity metric across samples or studies so that results have strong biological
relevance. Recent research has shown that, because of differences in experimental
designs and laboratory specific effects, integrated gene expression data from multiple
studies may have poor correlation under the biological context of the interest. To
overcome this problem, we propose:
• To use proteinprotein interactions (PPI) with the Random Walk algorithm to
determine genegene proximities.
• To use GO cellular component (CC) and biological process (BP) annotations
with termterm semantic similarity to determine genegene similarities.
108
4.2 Fuzzy cluster analysis methods using Fuzzy CMeans (FCM)
The FCM algorithm has several limitations: the initialization of the partition
matrix can cause the algorithm to remain in local optima, the partially missing data may
cause the algorithm to work improperly, there is no strong mathematical/statistical and
biological basis to validate the fuzzy partition, FCM cannot determine the number of
clusters in the dataset, and it does not have an effective method to address the problem
of defuzzification, where the method using the maximum membership degree principle
(MMD) has been proved inappropriate for gene expression data.
To address these problems, we first modify the FCM objective function using the
partition coefficient because of its ability to maintain the partition compactness (Section
3.5.1). For initialization of the partition matrix, we developed a new Subtractive
Clustering (SC) method using fuzzy partition of the data instead of the data themselves
[146]. To address the problem of partially missing data, we developed two new
imputation methods using density and probability based approaches [154, 155].
Regarding the cluster validation problem, we developed a new evaluation method for
fuzzy clustering, particularly for gene expression data [147]. The evaluation model
considers both the internal validity, which is statistical in nature, and the external
validity, which is biological. For the problem of cluster number, we developed a new
clustering algorithm that integrates our algorithms with the optimization genetic
algorithm (GA) to manage a set of solution candidates [157]. Using our fuzzy cluster
evaluation method, the optimization algorithm can search for the best solutions during
its evolutionary processes. To produce classification information using fuzzy partition,
109
we developed a novel defuzzification method that generate a probabilistic model of the
possibilistic model described by the fuzzy partition and apply the model to produce the
classification information of the dataset [156].
4.2.1 Modification of the objective function
Both the partition coefficient (PC) and partition entropy (PE) are best measures
for the compactness of fuzzy partition. Because PC performs better than PE regarding
this issue (Section 3.5.1), PC was, therefore, used as the compactness factor in the FCM
objective function of our method:
jr n (U,V) = XZ<d 2 (x 1 ,v k ) + ^Zu» >min, (4.5)
i=l k=l i=l k=l
where u, the learning rate, is defined as in (4.78), Section 4.3.1.
We consider minimizing (4.5) w.r.t. (3.6) using the method of Lagrange multipliers. Let
the Lagrange multiplier be A^ i=l,. . .,n and put
i=l k=l i=l k=l i=l k=l
For the necessary optimal condition,
dL
du ki
^ = m(uS 1 )[d 2 (x 1 ,v k ) + M ]^ i =0, (4.6)
dv k i=i
With the same reasoning as in (3.9) and (3.10), we obtain
fix n
' ^ = 2X<hv k  = 0. (4.7)
110
u ki =
X,
^
m(d (x ; ,v k ) + n)
ml
X, V 1
m
1
v d (x i5 v k ) + n y
ml
(4.8)
Summing up for k=l,. . .,c and taking (3.6) into account, we have,
Hence,
l = Zu ki =
f\ \
k=l
^
v m y
ml ,
k=l
^
v d 2 (x i ,v k ) + a y
ml
X: Vl
m
Z
k=l
d2 ( X 1? V k) + l Ll .
ml
Together with (4.8), we obtain the following for the membership in the FCM version of
our method,
1
d 2 (x 1? v k ) + u
Vi
^
1
d 2 (Xi,V!) + n
^ml
(4.9)
Similarly, we derive from (4.7) for each k, 1 < k < c:
i=l / i=l
(4.10)
The update model of = {U,V} in our FCM method used (4.9) and (4.10).
4.2.2 Partition matrix initialization method
FCM has recently been integrated with optimization algorithms, such as, the
Genetic Algorithm, Particle Swarm Optimization, and Ant Colony Optimization [161
166, 176177]. Alternatively, a Subtractive Clustering (SC) method has been used with
FCM, where SC is used first to determine the optimal number of clusters and the
111
location of cluster centers [105108] and FCM is then used to determine the optimal
fuzzy partitions [105, 111113]. While this approach can overcome the problem of
initialization of parameters, it still requires a priori specification of the parameters of the
SC method: the mountain peak and the mountain radii. SC uses these two parameters to
compute and amend the value of the mountain function for every data point while it is
looking for the cluster candidates. Most approaches using the traditional SC method use
constant values for these parameters [105, 109111]. However, different datasets have
different data distributions, and, therefore, these values need to be adjusted accordingly.
Yang and Wu [106] proposed a method to do this automatically. However, because the
data densities are not always distributed equally within the dataset, the automatic values
may be appropriate only for some data points.
In general, use of the SC method to determine the optimal number of clusters is
based on the ratio between the amended value and the original value of the mountain
function at every data point. The data points at which the ratios are above some
predefined cutoff are selected and their number is used as the optimal number of
clusters [105, 107, 110]. This approach, however, requires the specification of the cutoff
value which will differ among datasets.
Method design  histogrambased density
Given = {U,V}, a fuzzy partition on X, the accumulated density at Vk,
k=l,...,c, is calculated [112] as
112
Acc(v k ) = Xu kl . (4.11)
i=l
By assuming a uniform density of the data observed in each cell, an estimate
fhist(x) of the underlying probability density function f(x) at any point x of Vk can be
computed by:
W*) = ^ (4.12)
sh
where h is the bin width and s is the number of bins generated on Vk. The density at each
data point xi is estimated, using a histogram based method, as
dens(x x ) = ^Acc(v k )xu kl . (4.13)
k=l
However, the histogram density estimator as in (4.12) has some weaknesses. It
is a discontinuous function, and the choice of both the values of h and s has quite an
effect on the estimated density. The apriorism needed to set those values make it a tool
whose robustness and reliability are too low to be used for statistical estimation. The
density estimator in (4.13) is therefore more true at the centers {vk} than at other data
points [1 12]. If we try to find the most dense data points, c, using (4.13), we will obtain
the centers of c clusters of the fuzzy partition. To address this problem, we use the
concept of strong uniform fuzzy partition.
Definition 4.1 (strong uniform fuzzy partition) [112]: Let mi < ni2 < ... < m c be c fixed
nodes of the universe d (or universal set), d=[a,b], such that mi = a and m c = b, and
c>2. We say that the set of c fuzzy subsets Ai, A 2 ,..., A c , identified with their
113
membership function jliai, ^a2,. M^ac defined on 0, form a strong uniform fuzzy
partition of 0, if they fulfill the following conditions, for 1< k <c,
1) LiAk(m k )=l (the protoype).
2) If x g[m k _i, m k +i] then iAk(x) = 0.
3) LiAk(x) is continuous.
4) LiAk(x) monotically increases on [mti, m k ] and monotically decreases on
[m k , m k+ i].
5) Vx e &, 3k such that LiAk(x) > 0.
c
6) Vx e &, ^w A k( x ) = 1 (strength condition).
k=l
7) m k +im k = h, 1 < k < c.
8) LiAk(m k x) = LiAk(m k +x), 1< k < c.
9) Vx G[m k , m k +i], LiAk(x) = LiAki(xh) (same shape at overlap region).
Note that the 6 th condition is the strength condition, i.e. that without this condition
the fuzzy partition is no longer strong.
Proposition 4.1 (extended fuzzy partition) [112]: Let A k , k=l,..., c, be a strong uniform
fuzzy partition on D, extended on D', then
1) Vx, cgD, 3! k g {1,..., c1}, such that Vk g {k , k +l}, iAk (x) = 0, and
MAko(x) + iAkO+l (x)= 1.
iix k+i
2) Fork= l,...,c, n Ak (x)dx = h.
3) 3K A : [!,!]> [0,1], such that, n Ak (x) = K A L xU I (mkl,mk + l) !
114
1
and [ K A (u)du = 1 .
Definition 4.2\ The fuzzy histogram density estimator defined on D is given by
1 c
f hist ( x ) = r Z Acc ( v k ) x u Bk ( x ) ,
nh k=1
where {Bk}k=i,...,c is a strong uniform fuzzy partition defined on D.
(4.14)
It can easily be shown that fhi St > and J fhi St (x)dx = 1, and that fhist goes through
the c points (mk, Acc(vk)/nh).
We now define a new fuzzy partition f = {U f ,V}, based on Definition 4.1 and
Proposition 4.1, using = {U,V},
( hVkllV 1 / r ( VA\
1
U k! =
1=1
(4.15)
V J
where
= Z p (
x iV k )XiVJ
(4.16)
i=l
Because {U f ,V} is a strong uniform fuzzy partition [112], using Definition 4.2, the
density at every data point can then be estimated as
dens(x i ) = ^Acc(v k )xu ki .
(4.17)
k=l
Fuzzy mountain function amendment
Each time the most dense data point is selected, the mountain function of the
other data points must be amended to search for new cluster centers. In the traditional
115
SC method, the amendment is done using the direct relationships between the selected
data point and its neighborhood, i.e., using a prespecified mountain radius. In our
approach, this is done using the fuzzy partition and no other parameters are required.
First, the accumulated density of all cluster centers is revised using
Acc t+1 (v k ) = Acc t (v k )M;xP(v k  X ;), (4.18)
where x t is the data point selected as the new cluster center and M t is the mountain
function value at time t. The density at each data point is then reestimated using (4.17)
and (4.18) as follows,
dens t+1 (x 1 ) = XAcc t+1 (v k )xu kl
k=l
= X[Acc t (v k )M;xP(v k x;)]xu kl
k=l
c c
= XAcc t (v k )xu kl  M;ju ki xP(v k  x' t )
k=l k=l
c
= dens t (x 1 )M;^u kl xP(v k x;).
k=l
Hence, the estimated density at each data point is amended as:
dens t+1 (x 1 ) = dens t (x 1 )M;Ju kl xP(v k \x\). (4.19)
k=l
116
fzSC fuzzy subtractive clustering algorithm [146]
Steps
1) Generate a fuzzy partition A using FCM.
2) Create a strong uniform fuzzy partition B using (4. 1 5).
3) Estimate the density at every xi e X using (4.17).
4) Select the most dense data point as cluster candidate.
5) If the stopcondition is met then Stop.
6) Reestimate the density at every xi using (4. 19).
7) Go to Step 4.
To determine the optimal number of clusters in the dataset, we first locate the
most dense data points, LVnJ. Only the data points where the ratio between the mountain
function values at time t and time is above 0.95, which is considered a stopcondition,
are selected. This number is considered the optimal number of clusters in the dataset.
Method testing
To evaluate the ability of fzSC in searching for the best cluster centers, we ran
fzSC on the ASET1 (Figure 313) with six clusters. The result is shown in Figure 45.
We compared fzSC with three standard algorithms, kmeans, kmedians and
fuzzy cmeans, using the ASET1 dataset (Figure 313). Each algorithm was run 20
times and the correctness ratio was averaged. Results are shown in Table 4.5.
117
12D r 
1DD 
BD 
SD 
4D 
2D ■
>**
+**
♦ *
•tsir
1
+ + *:*
** ++JH+ +
******
tm*
+
6 +
***
+
* + *L *
+ *Jfr*+
4 *+
*5L* +
*+ *♦**♦+
**F +
* V
2:
o
:=:
e:
1DD
12D
Figure 45 : Candidate cluster centers in the ASET1 dataset found using fzSC. Squares,
cluster centers by FCM; dark circles, cluster centers found by fzSC. Classes are labeled 1 to 6.
Table 4.5 : Algorithm performance on the ASET1 dataset
Algorithm
Correctness ratio by
class
Avg. Ratio
1
2
3
4
5
6
fzSC
1.00
1.00
1.00
1.00
1.00
1.00
1.00
kmeans
0.97
0.87
1.00
1.00
1.00
0.75
0.93
kmedians
0.95
0.82
1.00
1.00
1.00
0.62
0.90
FCM
0.97
1.00
0.95
1.00
1.00
0.96
0.98
With a correctness ratio of 100% for every class and the whole dataset, fzSC
always successfully detected the optimal partition of ASET1 while the other methods
using a random method for partition matrix did not.
118
To check the ability of fzSC to detect the number of clusters, we ran fzSC on the
artificial datasets 20 times each and averaged the performance.
Table 4.6 : fzSC correctness in determining cluster number on artificial datasets
The number of clusters
The dataset dimension
generated in the dataset
2
3
4
5
5
0.97
1.00
1.00
1.00
6
1.00
0.98
0.90
1.00
7
1.00
1.00
1.00
1.00
8
1.00
0.99
0.97
1.00
9
0.87
0.99
1.00
0.96
Regarding the problem of partition matrix initialization, fzSC provides better
results in comparison with the three standard clustering algorithms using a random
mechanism.
When checking the ability of fzSC on artificial datasets, the performance is
significant. It performs worse on the datasets with nine clusters in the data space with
the dimension of 2. We tried to plot some of these datasets. The results show that some
clusters mostly overlap and tend to merge into one. We, therefore, can assume that the
performance of fzSC is not good on these datasets because the data were not properly
generated.
For the real datasets, Iris, Wine, Glass and Breast Cancer Wisconsin, the number
of clusters are 3, 3, 6, and 6 respectively. The data in the Wine and Glass datasets were
normalized by attributes before clustering. The Pearson correlation distance, which is
the most appropriate for real datasets, was used to measure the similarity between data
119
points. We ran fzSC 20 times on each of these datasets with the value of L, the number
of solutions, set to 15. The number of clusters found was compared with the known
number in the dataset. Results are shown in Table 4.7.
Table 4.7 : fzSC performance on real datasets
Dataset
# data points
#clusters
predicted #clusters
correctness ratio
Iris
150
3
3
1.00
Wine
178
3
3
1.00
Glass
214
6
6
5
0.95
0.05
Breast Cancer
Wisconsin
699
6
6
5
0.65
0.35
We have proposed a novel algorithm to address the problem of parameter
initialization of the standard FCM algorithm [146]. Our method is novel because the
subtractive clustering technique uses fuzzy partition of the data instead of the data
themselves. The advantages of fzSC are that, unlike traditional SC methods, it does not
require specification of the mountain peak and mountain radii, and, with a running time
of O(cn), compared to 0(n 2 ) for the traditional SC method, it is more efficient for large
datasets. In addition, fzSC can be integrated easily with fuzzy clustering algorithms to
search for the best centers of cluster candidates, or to address other existing problems of
FCM such as missing data and fuzzifier determination problems.
We combined FCM with a new SC method to automatically determine the
number of clusters in a dataset. FCM randomly creates a set of fuzzy partition solutions
for the dataset using a maximum number of clusters; SC then uses the fuzzy partition
approach to search for the best cluster centers and the optimal number of clusters for
120
each solution. FCM then rapidly determines the best fuzzy partition of every solution
using the optimal number of clusters determined by SC. The best solution from this
solution set is the final result.
4.2.3 Fuzzy clustering evaluation method
Recently, many cluster validity indices have been developed to evaluate results
of FCM. Bezdek [92] measured performance using partition entropy and the overlap of
adjacent clusters. Fukuyama and Sugeno [93] combined the FCM objective function
with the separation factor, while Xie and Beni [94] integrated the Bezdek index [92]
with the cluster separation factor. Rezaee et al. [95] combined the compactness and
separation factors, and Pakhira et al. [96] combined the same two factors where the
separation factor was normalized. Recently, Rezaee [97] proposed a new cluster index
in which the two factors were normalized across the range of possible numbers of
clusters. These were introduced in Section 3.5.1.
Here, we propose a fuzzy cluster validation method that uses the fuzzy partition
and the distance matrix between cluster centers and data points. Instead of compactness
and separation, fzBLE uses a Bayesian model and a loglikelihood estimator. With the
use of both the possibility model and the probability model to represent the data
distribution, fzBLE is appropriate for artificial datasets where the distribution follows a
standard model, as well as for real datasets that lack a standard distribution. We show
that fzBLE outperforms popular cluster indices on both artificial and biological datasets.
121
Method design  Bayesian & likelihood based approach
Instead of using the two factors, compactness and separation, we propose a
validation method (fzBLE) that is based on a log likelihood estimator with a fuzzy
based Bayesian model.
Given a fuzzy clustering solution modeled by = {U, V}, where V represents
the cluster centers and U is the partition matrix representing the membership degrees of
the data points to the clusters. The likelihood of the clustering model and the data is
measured as
L(0X) = L(U,VX)
=np(x l iu,v)=nxp(v k )xp(x l iv k ),
i=l i=l k=l
where the log likelihood is computed using the following,
log(L) = ]TlogXP(v k )xP( Xi v k )
i=l Vk=l J
(4.20)
^max. (4.21)
Because our clustering model is possibilitybased, to apply Equations (4.20) and
(4.21), a transformation from possibility to probability is needed. Given a fuzzy
clustering model = {U, V}, according to [98], Uki is the possibility that Vk = xi. If is a
proper fuzzy partition, then there exists some x such that Uk(x ) = 1, k=l,. . .,c, and Uk
is a normal possibility distribution. Assume Pk is the probability distribution of Vk on X,
where pki > pk2 > pk3 >• • . > pkn We associate with Pk a possibility distribution Uk on X
[98] such that u^ is the possibility of Xi, where
122
u k „=nxp kn ,
u kl =i(PkiPk,i + i)+u k;i+1 , i = (n !),...,!.
(4.22)
Reversing (4.22), we obtain the transformation of a possibility distribution to a
probability distribution. Assume that Uk is ordered the same way with Pk on X:
U k i > Uk2 > U k3 >. . > U kn .
P kn =u kn /n,
Pki=Pk, 1+ i+(u ki u kji+1 )/i.
(4.23)
Pk is an approximate probability distribution of Vk on X, and pki = P(xivk). If Uk
is a normal possibility distribution then Zpki = 1 .
The data distributions
Having computed the value of Pk, we can estimate the variance <Tk, the prior
probability P(vk), and the normal distribution of Vk for cluster k [27].
V II II 2
(4.24)
i=l
iP(XiK)
P(v k )="
ZlPCxJv,)
(4.25)
1=1 i=l
Pp(xjv k ) =
x,v k  :
2 A
(2<> 1/p XG k xe
(4.26)
In real datasets, for a cluster Vk, the data points usually come from different
random distributions. Because they cluster in Vk, they tend to follow the normal
123
distribution of Vk estimated as in (4.26). This idea is based on the Central Limit
Theorem.
Theorem 4.1 (Central Limit Theorem): S n is the sum of n mutually independent random
variables, then the distribution function of S n is wellapproximated by a certain type of
continuous function known as a normal density function, which is given by the formula
i ( x ~^) 2
f^W = — s=e ^ , (4.27)
GV27T
where u and a are the mean and variance of S n respectively. We integrate the
probabilities computed in (4.23) and (4.27) for the probability of the data point xi given
cluster Vk as
P*(x,  v k ) = max{P(x 1  v k ),P n (x 1  v k )}. (4.28)
Equation (4.28) better represents the data distribution, particularly in real datasets. Our
method, fzBLE [147], is based on (4.21), (4.26) and (4.28).
Method testing
We tested fzBLE with other wellknown cluster indices: partition coefficient 
PC (3.23) and partition entropy  PE (3.24) [92], FukuyamaSugeno  FS (3.25) [93],
Xie and Beni  XB (3.26) [94], Compose Within and Between scattering  CWB (3.28)
[95], PakhiraBandyopadhyayMaulik cluster index  PBMF (3.29) [96], and Babak
Rezaee cluster index  BR (3.30) [97] using the artificial and real datasets.
For each artificial dataset, we ran the standard FCM algorithm five times with m
set to 2.0 and the partition matrix initialized randomly. In each case, the best fuzzy
124
partition was then selected to run fzBLE and the other cluster indices to search for the
optimal number of clusters using numbers from 2 to 12 and to compare this with the
known number of clusters. We repeated the experiment 20 times and averaged the
performance of each method. Table 4.8 shows the fraction of correct predictions. fzBLE
and PBMF outperform other approaches. The CF method, which is based on the
compactness factor, was the least effective.
Table 4.8 : Fraction of correct cluster predictions on artificial datasets
#c
fzble
PC
PE
FS
XB
CWB
PBMF
BR
CF
3
1.00
0.42
0.42
0.42
0.42
1.00
1.00
0.83
0.00
4
1.00
0.92
0.92
0.92
0.83
1.00
1.00
1.00
0.00
5
1.00
0.75
0.75
0.83
0.75
0.83
1.00
1.00
0.00
6
1.00
0.92
0.83
0.92
0.58
0.58
1.00
0.92
0.00
7
1.00
0.83
0.83
0.83
0.67
0.58
1.00
0.67
0.00
8
1.00
1.00
0.92
1.00
0.92
0.67
1.00
0.83
0.00
9
1.00
0.92
0.67
0.92
0.67
0.33
1.00
0.83
0.00
Table 4.9 : Validation method performance on the Iris dataset (3 true clusters)
#c
fzble
PC
PE
FS
XB
CWB
PBMF
BR
CF
2
763.0965
0.9554
0.0977
10.6467
0.0203
177.1838
12.3280
1.1910
0.9420
3
762.8034
0.8522
0.2732
9.3369
0.1292
213.4392
17.7131
1.0382
0.3632
4
764.8687
0.7616
0.4381
7.4821
0.2508
613.2656
14.4981
1.1344
0.2665
5
770.2670
0.6930
0.5703
8.2331
0.3473
783.4697
13.6101
1.0465
0.1977
6
773.6223
0.6549
0.6702
7.3202
0.2805
904.3365
12.3695
1.0612
0.1542
7
.774.4740
0.6155
0.7530
6.8508
0.2245
1029.7342
11.2850
0.9246
0.1262
8
774.8463
0.6000
0.8111
6.9273
0.3546
1635.3593
10.5320
0.8692
0.1072
9
780.1901
0.5865
0.8556
6.6474
0.3147
1831.5705
9.9357
0.7653
0.0905
10
781.7951
0.5765
0.8991
6.0251
0.2829
2080.3339
9.3580
0.7076
0.0787
125
Tables 4.9  4.11 show the test results on the Iris, Wine and Glass datasets. For
the Iris dataset, only fzBLE and PBMF detected the correct number of clusters (Table
4.9). For the Wine and Glass datasets, only fzBLE and CWB, and only fzBLE,
respectively, detected the correct number of clusters (Tables 4.10 and 4.1 1).
Table 4.10 : Validation method performance on the Wine dataset (3 true clusters)
#c
fzble
PC
PE
FS
XB
CWB
PBMF
BR
CF
2
926.4540
0.9264
0.1235
113.0951
0.1786
3.9100
1.3996
2.0000
61.1350
3
924.0916
0.8977
0.1764
104.9060
0.2154
3.2981
0.9316
1.4199
39.3986
4
932.8377
0.8607
0.2525
139.9144
0.5295
6.6108
0.6306
1.1983
33.7059
5
929.6146
0.8225
0.3281
126.5746
0.5028
6.9001
0.4700
1.0401
28.4741
6
928.8121
0.8066
0.3669
118.4715
0.6173
9.2558
0.3706
0.9111
25.3451
7
930.6451
0.7988
0.3874
120.3128
0.6465
10.3803
0.2972
0.7629
23.1742
8
932.0462
0.7993
0.3917
124.7999
0.6459
11.0836
0.2471
0.6392
21.4411
9
932.1902
0.7929
0.4120
122.8396
0.6367
11.8373
0.2100
0.5801
19.9154
10
935.0478
0.7909
0.4217
130.9089
0.6270
11.9941
0.1773
0.5252
18.9891
Table 4.1 1 : Validation method performance on the Glass dataset (6 true clusters)
#c
fzble
PC
PE
FS
XB
CWB
PBMF
BR
CF
2
1135.6886
0.8884
0.1776
0.3700
0.7222
6538.9311
0.3732
1.9817
0.5782
3
1127.6854
0.8386
0.2747
0.1081
0.7817
4410.3006
0.4821
1.5004
0.4150
4
1119.2457
0.8625
0.2515
0.0630
0.6917
3266.5876
0.4463
1.0455
0.3354
5
1123.2826
0.8577
0.2698
0.1978
0.6450
2878.8912
0.4610
0.8380
0.2818
6
1113.8339
0.8004
0.3865
0.2050
1.4944
5001.1752
0.3400
0.8371
0.2430
7
1116.5724
0.8183
0.3650
0.2834
1.3802
5109.6082
0.3891
0.6914
0.2214
8
1127.2626
0.8190
0.3637
0.3948
1.4904
7172.2250
0.6065
0.5916
0.2108
9
1117.7484
0.8119
0.3925
0.3583
1.7503
8148.7667
0.3225
0.5634
0.1887
10
1122.1585
0.8161
0.3852
0.4214
1.7821
9439.3785
0.3909
0.4926
0.1758
11
1121.9848
0.8259
0.3689
0.4305
1.6260
9826.4211
0.3265
0.4470
0.1704
12
1135.0453
0.8325
0.3555
0.5183
1.4213
11318.4879
0.5317
0.3949
0.1591
13
1138.9462
0.8317
0.3556
0.5816
1.4918
14316.7592
0.6243
0.3544
0.1472
126
For the Yeast dataset, we ran the FCM algorithm with m set to 1.17 [60] and
used the clustering partition to test all methods as in previous sections. Table 4.12
shows that only fzBLE detected the correct number of clusters (five) in Yeast dataset.
Table 4.12 : Validation method performance on the Yeast dataset (5 true clusters)
#c
fzble
PC
PE
FS
XB
CWB
PBMF
BR
CF
2
2289.8269
0.9275
0.1172
85.1435
0.2060
8.3660
1.2138
2.0000
133.0734
3
2296.4502
0.9419
0.0983
157.2825
0.2099
4.7637
0.6894
1.0470
94.6589
4
2305.3369
0.9437
0.1000
191.7664
0.2175
4.0639
0.5575
0.7240
74.7629
5
2289.3070
0.9087
0.1648
187.1073
1.0473
13.6838
0.4087
0.6722
65.9119
6
2296.3098
0.8945
0.1939
196.6711
0.9932
13.8624
0.3050
0.6170
60.8480
7
2296.6017
0.8759
0.2299
198.2858
1.0558
15.4911
0.2434
0.5686
56.1525
8
2299.4225
0.8634
0.2526
201.7688
1.0994
16.9644
0.2050
0.5132
51.2865
9
2299.3653
0.8453
0.2871
205.1489
1.2340
20.2532
0.1741
0.4819
48.0737
10
2302.7581
0.8413
0.2992
208.5687
1.1947
20.7818
0.1512
0.4533
45.9442
11
2300.3294
0.8325
0.3186
209.4023
1.1731
21.1525
0.1307
0.4272
43.6600
12
2307.5701
0.8290
0.3272
213.4658
1.2245
23.0389
0.1157
0.4040
42.1594
13
2310.7819
0.8270
0.3354
215.2463
1.3036
25.4062
0.1016
0.3847
40.8654
Table 4.13 : Validation method performance on the YeastMIPS dataset (4 true clusters)
#c
fzble
PC
PE
FS
XB
CWB
PBMF
BR
CF
2
1316.4936
0.9000
0.1625
25.4302
0.3527
16.7630
0.7155
1.9978
81.0848
3
1317.3751
0.9092
0.1615
32.8476
0.2981
10.1546
0.8032
1.2476
58.2557
4
1304.0374
0.8216
0.3252
39.4858
2.5297
39.8434
0.5400
1.3218
48.6275
5
1308.6776
0.8279
0.3216
54.4979
2.4245
34.9963
0.3620
0.9558
41.9671
6
1309.9191
0.8211
0.3460
59.8918
2.3511
35.4533
0.2691
0.8291
38.5468
7
1315.3692
0.8139
0.3654
65.4866
2.3562
38.8797
0.2423
0.7252
36.0906
8
1315.1479
0.8062
0.3918
67.6774
2.4958
43.9502
0.1966
0.6712
34.1387
9
1321.2280
0.8109
0.3874
72.3197
2.2854
41.2112
0.1664
0.6072
32.3289
10
1324.1578
0.8158
0.3847
74.7867
2.0433
37.6154
0.1395
0.5588
30.9686
127
For the YeastMIPS dataset, we ran the FCM algorithm using the same
parameters as with the Yeast dataset. The results in Table 4.13 show that only fzBLE
correctly detected the four clusters in the YeastMIPS dataset.
For the RCNS (Rat Central Nervous System) dataset, we ran fzBLE and the
other cluster indices on the dataset clustering partition found by the standard FCM
algorithm using the Euclidean metric for distance measurement. Table 4.14 shows that,
again, only fzBLE detected the correct number of clusters.
Table 4.14 : Validation method performance on the RCNS dataset (6 true clusters)
#c
fzble
PC
PE
FS
XB
CWB
PBMF
BR
CF
2
580.0728
0.9942
0.0121
568.7972
0.0594
5.5107
4.2087
1.1107
177.8094
3
564.1986
0.9430
0.0942
487.6104
0.4877
4.1309
4.2839
1.6634
117.9632
4
561.0169
0.9142
0.1470
430.4863
0.9245
6.1224
3.3723
1.3184
99.1409
5
561.7420
0.8900
0.1941
397.0935
1.3006
9.4770
2.6071
1.1669
88.5963
6
552.9153
0.8695
0.2387
300.6564
2.5231
20.6496
1.9499
1.1026
84.0905
7
556.2905
0.8707
0.2386
468.3121
2.1422
21.0187
2.8692
0.7875
57.5159
8
555.3507
0.8925
0.2078
462.0673
1.7245
20.0113
2.5323
0.5894
52.0348
9
558.8686
0.8863
0.2192
512.4278
1.6208
22.4772
2.6041
0.5019
45.9214
10
565.8360
0.8847
0.2241
644.1451
1.1897
21.9932
3.4949
0.3918
33.1378
We presented a novel method that uses the log likelihood estimator with a
Bayesian model and the possibility, rather than the probability, distribution model of the
dataset from the fuzzy partition. By using the Central Limit Theorem, fzBLE effectively
represents distributions in real datasets. Results have shown that fzBLE performs
effectively on both artificial and real datasets.
128
4.2.4 Fuzzy clustering evaluation using Gene Ontology [180]
The shortcoming of fzBLE is that it is solely based on the data and, therefore,
unable to apply prior biological knowledge into cluster evaluation. We developed
GOfzBLE, a cluster validation method that generates the fuzzy partition for a given
clustering solution using GO term based semantic similarity and applies fzBLE method
to evaluate the clustering solution using the fuzzy partition.
Gene Ontology
The Gene Ontology (GO) [181] is a hierarchy of biological terms using a
controlled vocabulary that includes three independent ontologies for biological process
(BP), molecular function (MF) and cellular component (CC). Standardized terms known
as GO terms describe roles of genes and gene products in any organism. GO terms are
related to each other in the form of parentchild relationships. A gene product can have
one or more molecular functions, can participate in one or more biological processes,
and can be associated with one or more cellular components [182]. As a way to share
knowledge about functionalities of genes, GO itself does not contain gene products of
any organism. Rather, expert curators specialized in different organisms annotate
biological roles of gene products using GO annotations. Each GO annotation is assigned
with an evidence code that indicates the type of evidence supporting the annotation.
Semantic similarity
GO is structured as directed acyclic graphs (DAGs) in which the terms form
nodes, and the two kinds of semantic relationships, "isa" and "partof , form edges
[184]. "isa" is a simple classsubclass relation, where A isa B means that A is a
129
subclass of B. 'partof ' is a partial ownership relation; C partof D means that
whenever C is present, it is always a part of D, but C need not always be present. The
structure of DAG allows assigning a metric to a set of terms based on the likeliness of
their meaning content which is used to measure semantic similarity between terms.
Multiple GO based semantic similarity measures have been developed [183, 184], and
are increasingly used to evaluate the relationships between proteins in proteinprotein
interactions, or coregulated genes in gene expression data analysis. Among of the
existing semantic similarity measurement methods, that of Resnik is most appropriate
for gene expression data analysis because it is strong correlated with gene sequence
similarities and gene expression profiles [183]. However, Wang et al. [184] had shown
that the Resnik' s method has a drawback in that it uses only the information content
derived from annotation statistics which is not suitable for measuring semantic
similarity of GO terms. We therefore propose to use Wang's method for GO semantic
similarity measurement. For each term A, a semantic value, S(A), is computed as:
S(A)=£S A (t), (4.29)
teT A
where T A is a set of terms including the term A and its ancestors, and Sa(.) is the
sematic value regarding the term A, defined as in (4.30),
S A (t)H > ( 4  3 °)
AV [max{w?xS A (u),ueChildrenOf(t)},t*A
where w" is the semantic contribution factor [184] for edge connecting term t with its
child, term u. The semantic similarity between two terms, A and B, is defined as:
130
X(S A (t) + S B (t))
S GO (A,B) = ^^ . (4.31)
GO SV(A) + SV(B)
Using the semantic similarities between GO terms, the semantic similarity
between the two GO term sets, Gl and G2, is defined as:
£Sim(gl,G2)+ £sim(g2,Gl)
Sim(Gl,G2) = ^i g2eG2 , (4.32)
Gl + G2
where Sim(t,T) is the similarity between the term t and the term set T and is defined as
in (4.33),
Sim(t, T) = max{S G0 (t, u)} . (4.33)
ueT
Method design
For a given crisp clustering solution of gene expression data, an approximate
fuzzy partition is generated based on the genegene semantic similarities using GO term
annotations. The fuzzy partition is applied to the fzBLE method for evaluation of the
clustering solution. Each crisp clustering solution is modeled with C = {M, V}, where
V represents the cluster centers and M, M={mki}; niki e {0,1}; k=l,...,c; i=l,...,n, is the
crisp cpartition matrix representing the membership of the data points to the clusters. In
case where a fuzzy clustering solution 0, = {U, V}, is given, C can be computed
based on using our defuzzification method for fuzzy partition [154].
Fuzzification of a crisp clustering partition In order to use GO terms to validate
crisp clustering solution, a fuzzy partition is generated based on the crisp partition using
GO semantic similarity. This is done by assigning the crisp partition into the space of
131
GO terms. Each data object xi, i=l,..., n, is corresponding with a vector of GO
annotations x G0 , and a vector of degrees of belief (DOB) x CF , where x CF is the DOB
of the term x G0 annotated to xi (Table 4.15). Because GO annotations of a gene/gene
product may come from different sources (Table 4.16), use of DOBs can help with
combining multiple annotations using the same term of the same data object.
Table 4.15 : Degrees of belief of GO annotation evidences
Evidence
Degree of belief
EXP
1.0
IDA, IPI, TAS
0.9
IMP, IGI, IEP
0.7
ISS, ISO, ISA, ISM
0.4
IGC
0.2
IBA, IBD, IKR, IRD, RCA
0.3
NAS, IC, ND, NR
0.0
IEA
0.1
The GO annotations for each cluster Vk, v k ° for k=l,..., c, can be determined
using the GO annotations of the members of cluster Vk:
v k G0 = U x G0 (4.34)
GO
r CF
For each annotation v kt of Vk, t=l,..., v k , its DOB, v kt , is computed using the
DOBs of cluster's members annotated with term t:
v =mean(xf ,x e v k ,x G ° = v°°J (4.35)
132
Table 4.16 : Gene Ontology evidence codes
Experimental Evidence Codes
EXP
Inferred from Experiment
IDA
Inferred from Direct Assay
IPI
Inferred from Physical Interaction
IMP
Inferred from Mutant Phenotype
IGI
Inferred from Genetic Interaction
IEP
Inferred from Expression Pattern
Computational Analysis Evidence Codes
ISS
Inferred from Sequence or Structural Similarity
ISO
Inferred from Sequence Orthology
ISA
Inferred from Sequence Alignment
ISM
Inferred from Sequence Model
IGC
Inferred from Genomic Context
IBA
Inferred from Biological aspect of Ancestor
IBD
Inferred from Biological aspect of Descendant
IKR
Inferred from Key Residues
IRD
Inferred from Rapid Divergence
RCA
inferred from Reviewed Computational Analysis
Author Statement Evidence Codes
TAS
Traceable Author Statement
NAS
Nontraceable Author Statement
Curator Statement Evidence Codes
IC
Inferred by Curator
ND
No biological Data available
Automaticallyassigned Evidence Codes
IEA
Inferred from Electronic Annotation
Obsolete Evidence Codes
NR
Not Recorded
133
We apply the semantic similarity measure as in (4.32) to compute the semantic
similarity between cluster Vk and data point xi using their GO annotations. We propose a
modification of (4.31) so that it can be used with DOB. For the term v£° of cluster Vk
and the term x?° of data point xi, their semantic similarity is defined as:
Qf / TT GO ^GO\ c S^rGO v GO\ WWk  s CF v CF \ s A ^r\
SooKt > x ij ) = S Go( v kt > x ij ) xmm (v kt9 x 1J ) (436)
We use (4.36) instead of (4.31) when computing the semantic similarity between Vk and
Xi, Sim^v^Xi), using (4.32). The distance between Vk and Xi based on GO semantic
similarity is then defined as:
d^(v k ,x 1 ) = d 2 (l,Sim'(v k ^ 1 )). (4.37)
A fuzzy partition model using GO terms space 0go, 0go = {Ugo, _}, of C is now
determined where Uqo is computed as in (4.9) using the distance function defined in
(4.37). 0qo is used with fzBLE for the clustering solution evaluation. Because 0go is
based on GO annotation, we proposed a method to compute the prior probability of Vk,
Pp(vk), k=l,..., c, using the prior probabilities of the GO terms used by Vk, T. For a
given term t, teT,
sup (v£)
P(tv k ) = U£V 7 c . , (4.39)
sum(v k )
134
where P p (t) and P(tvk) are the prior probability of the term t and the conditional
probability of the term t given Vk respectively. The prior probability of Vk, k=l,..., c, is
computed as in (4.40),
P p (v k )=EP(t,v k )xP p (t)
= XP(tv k )xP(v k )xP p (t)
(4.40)
Method testing
We tested GOfzBLE with fzBLE and the other cluster indices as in the Section
4.2.3 using the Yeast, YeastMIPS and RCNS datasets.
Table 4. 17 : Validation method performance on the Yeast dataset using GOBP
#c
GOfzBLE
PC
PE
FS
XB
CWB
PBMF
BR
fzBLE
2
4832.656
0.928
0.117
85.144
0.206
8.366
1.214
2.000
2289.827
3
4832.542
0.942
0.098
157.283
0.210
4.764
0.689
1.047
2296.450
4
4832.751
0.944
0.100
191.766
0.218
4.064
0.558
0.724
2305.337
5
4831.548
0.909
0.165
187.107
1.047
13.684
0.409
0.672
2289.307
6
4831.890
0.895
0.194
196.671
0.993
13.862
0.305
0.617
2296.310
7
4833.182
0.876
0.230
198.286
1.056
15.491
0.243
0.569
2296.602
8
.4833.644
0.863
0.253
201.769
1.099
16.964
0.205
0.513
2299.423
9
4832.358
0.845
0.287
205.149
1.234
20.253
0.174
0.482
2299.365
10
4832.246
0.841
0.299
208.569
1.195
20.782
0.151
0.453
2302.758
11
4832.600
0.833
0.319
209.402
1.173
21.153
0.131
0.427
2300.329
12
4832.668
0.829
0.327
213.466
1.225
23.039
0.116
0.404
2307.570
13
4832.217
0.827
0.335
215.246
1.304
25.406
0.102
0.385
2310.782
For the Yeast dataset, only GOfzBLE and fzBLE correctly identified the number
of clusters (Table 4.17). Table 4.18 again shows that these two algorithms detected
correctly the number of clusters. While fzBLE works based on gene expression levels,
135
GOfzBLE works based on GOBP annotations. The results also show that GOBP
annotations are strongly correlated with genegene differential coexpression patterns.
In other words, we may utilize either genegene coexpression patterns in gene
expression data or GOBP annotations to search for genes having similar functionality.
Table 4.18 : Validation method performance on the YeastMIPS dataset using GOBP
#c
GOfzBLE
PC
PE
FS
XB
CWB
PBMF
BR
fzBLE
2
2288.413
0.900
0.163
25.430
0.353
16.763
0.716
1.998
1316.494
3
2286.843
0.909
0.162
32.848
0.298
10.155
0.803
1.248
1317.375
4
2283.854
0.822
0.325
39.486
2.530
39.843
0.540
1.322
1304.037
5
2285.069
0.828
0.322
54.498
2.425
34.996
0.362
0.956
1308.678
6
2286.252
0.821
0.346
59.892
2.351
35.453
0.269
0.829
1309.919
7
2286.834
0.814
0.365
65.487
2.356
38.880
0.242
0.725
1315.369
8
2287.543
0.806
0.392
67.677
2.496
43.950
0.197
0.671
1315.148
9
2288.333
0.811
0.387
72.320
2.285
41.211
0.166
0.607
1321.228
10
2288.954
0.816
0.385
74.787
2.043
37.615
0.140
0.559
1324.158
Table 4.19 : Validation method performance on the RCNS dataset using GOCC
#c
GOfzBLE
PC
PE
FS
XB
CWB
PBMF
BR
fzBLE
2
970.340
0.994
0.012
568.797
0.059
5.511
4.209
1.111
580.073
3
970.460
0.943
0.094
487.610
0.488
4.131
4.284
1.663
564.199
4
969.909
0.914
0.147
430.486
0.925
6.122
3.372
1.318
561.017
5
969.669
0.890
0.194
397.094
1.301
9.477
2.607
1.167
561.742
6
969.659
0.870
0.239
300.656
2.523
20.650
1.950
1.103
552.915
7
970.148
0.871
0.239
468.312
2.142
21.019
2.869
0.788
556.291
8
969.787
0.893
0.208
462.067
1.725
20.011
2.532
0.589
555.351
9
970.300
0.886
0.219
512.428
1.621
22.477
2.604
0.502
558.869
10
970.487
0.885
0.224
644.145
1.190
21.993
3.495
0.392
565.836
136
The RCNS dataset has six groups of similarly expressed genes, two of which are
invariant [87]. We first ran GOfzBLE using GOCC annotations and compared with the
other methods. The results are shown in Table 4.19. Only GOfzBLE and fzBLE
identified six clusters in the dataset, corresponding to the six groups in the analysis
results performed by Wen et al. [87]. We then reran GOfzBLE using GOBP
annotations and compared with the other methods. The results are shown in Table 4.20.
GOfzBLE identified four clusters in the dataset that are corresponding to the four stages
in the developmental process. The result of GOfzBLE again shows that GOBP
annotations are strongly correlated with genegene differential coexpression patterns. It
also shows that GOBP annotations are useful in creating effective external criteria for
cluster analysis of gene expression data.
Table 4.20 : Validation method performance on the RCNS dataset using GOBP
#c
GOfzBLE
PC
PE
FS
XB
CWB
PBMF
BR
fzBLE
2
1373.999
0.994
0.012
568.797
0.059
5.511
4.209
1.111
580.073
3
1373.935
0.943
0.094
487.610
0.488
4.131
4.284
1.663
564.199
4
1373.776
0.914
0.147
430.486
0.925
6.122
3.372
1.318
561.017
5
1374.208
0.890
0.194
397.094
1.301
9.477
2.607
1.167
561.742
6
1374.496
0.870
0.239
300.656
2.523
20.650
1.950
1.103
552.915
7
1374.526
0.871
0.239
468.312
2.142
21.019
2.869
0.788
556.291
8
1374.811
0.893
0.208
462.067
1.725
20.011
2.532
0.589
555.351
9
1375.199
0.886
0.219
512.428
1.621
22.477
2.604
0.502
558.869
10
1375.426
0.885
0.224
644.145
1.190
21.993
3.495
0.392
565.836
4.2.5 Imputation methods for partially missing data
As discussed earlier in Section 3.4, given a data point Xi with missing attributes,
the global model based approach updates the missing values of Xi using the information
of the whole data. In contrast, the local model based approach does that using the
137
information of the data points in the neighborhood of xi. The former has an advantage of
eliminating the noise that can be caused by the data points around xi, particularly in the
case when xi is in the overlapping region. The latter, however, has the advantage that it
may provide the data distribution model that xi is following.
Recent research has shown that the global model based approach, the optimal
completion strategy (OCS) is an example, outperforms the local model based one [139,
140, 143]. Existing approaches usually update X M using either a global model, or a local
model. Luo et al. [141], by using a local update model integrated into the FCM
algorithm, showed that his method outperformed two wellknown methods: Knearest
neighborhood (KNN) and Local Least Squares Knearest neighborhood (SKNN).
However, the method of Luo et al. is only based on the local model information. Hence,
it may fail at the overlapping region. Mohammadi et al. [142] improve the work of Luo
et al. by using an external distance based on GO terms. The relationships of the terms in
GO in fact are locally detected, methods using them are just a kind of local model.
Regarding the problem of missing data, we developed separately two new
algorithms using our fzBLE and fzSC methods, namely probability based (fzPBI) and
density based (fzDBI) imputation algorithms, to create "abstract" fuzzy partition based
on the partition derived from the partially distance strategy method (PDS) [140]. These
abstract partitions not only contain the global distribution model of the data, but they
also represent the local distributions at every data points in the dataset. We then
developed new update models for X M using both the global and local data distribution
models derived from the abstract partitions [154, 155].
138
4.2.6 Probability based imputation method (fzPBI) [154]
Method design
The objective of fzPBI is to cluster a dataset X, that may be incomplete, into c
clusters, X m are estimated during the clustering process with respect to the optimization
of the function J m . Let X w = {xigX  Xi is a complete data object}, X P = {xy  Xy ^ ?},
and X M = {xy  xy= ?}. The distance measure, d (.) therefore is defined as:
where w, indicates the contribution degree of the j th attribute of xi, Xy, in the distance
between xi and Vk. If Xy e X m , w, increases with each iteration to avoid premature use of
estimated values. Therefore, we define Wj as:
f 1 x, e X p
W Ht/T Y ' (4 ' 42)
[t/T x { . g X M
where t is the iteration index, and T is the number of iterations; < t < T.
To impute the values of Xm, we applied our fzBLE method [148] to generate a
probabilistic model of the data distribution using the fuzzy partition. This model is then
used to impute the missing values. For each cluster Vk, k=l,...,c, a probability
distribution {pki}i=i,...,n is derived from the possibility distribution {uki}i=i,..., n as in
(4.22) and (4.23). The data distributions model is then estimated using (4.24), (4.25)
and (4.26). The model for missing data estimation is the developed as follows:
139
Let P k m ={Pki>Pk2>>Pw,} ^ e a set of probabilities at v k , where p£L j=l,...,p,
indicates the probability that attribute j is missing in cluster k, defined as:
t^^h)
n m = ■ i=1
Pki
IP(x„v k )
(4.43)
i=l
where
Jlx y eX p
IJ 10x s eX M
Hence, the probability that a data object xi has a missing attribute j in cluster k
is:
P(x M(ij)k ) = P(v k )p™
Ffp£
t=l,t*j
P(x,v k ).
(4.44)
Because each cluster Vk can be considered a component of the data distribution
model of X, the estimated value of x y  e X M is computed as:
X ij _ ^^V X M(ij)k fk / Zj^V X M(ij)k )•
(4.45)
k=l
k=l
To avoid an oscillation in the missing value imputation, estimated values are
used decreasingly. In contrast to their usage in the clustering process, their contribution
degree in distance measurement during the estimation process is defined as,
1 X^GXp
J llt/T Xlj eX M
(4.46)
140
fzPBI algorithm
Steps
1) t=0; initialize U t randomly w.r.t (3.2).
2) Compute {d 2 (x i9 v k )}i=i.. n ,k=i..c as in (4.41) and (4.42).
3 ) Compute U t+ 1 using (3.3).
4) Compute V t+1 using (3.4).
5) If (t > T) or (j^ +1  r m \\ < c) then Stop.
6) Compute {d 2 (xi, Vk)}i=i.. n ,k=i..c using w as in (4.46).
7) Create a probabilistic data distribution model from the fuzzy partition using
(4.24), (4.25) and (4.26).
8) Create a probabilistic model for X M using (4.43) and (4.44).
9) Estimate X M using (4.45). t=t+l. Go to Step 2.
The difference between fzPBI and the standard FCM algorithm is that fzPBI
provides a method to discover the probabilistic model of the data distribution from the
fuzzy partition and to apply the model to missing value imputation. fzPBI therefore can
address the problem of missing data itself.
Algorithm performance measures
We used two measures to evaluate algorithm performance. The first measure is
the root mean square error (RMSE) between the true values and the imputed values,
defined as:
141
RMSE= U^fcxJ
(4.47)
where ni is the number of missing values imputed. The second measure assesses the
overall performance determined by the number of data objects with missing attributes
that were misclassified. This assessment is done by comparing the cluster label of each
data object with its actual class label. If the two match, there is no misclassification. If
they do not match, then a misclassification has occurred, defined in (3.21).
Method testing
s
52
47
42
/.♦•'
37
/y
fzPBI
i/> 32
OCS
2
11
11
~A« NPS
•••♦■•• FCMimp
CIAO
FCMGOimp
17 \
12
f ^iB^
5 7 10 13 15 20 25 30 40 50
Percentage of missing value
Figure 46 : Average RMSE of 50 trials using an incomplete ASET2 dataset with
different percentages of missing values
To evaluate the performance of fzPBI, we compared fzPBI with six popular
imputation methods: PDS, OCS and NPS [140]; FCMimp [158]; CIAO [159]; and
FCMGOimp [160]. For each test dataset, we generated the missing data using different
142
percentages of missing values. The value of the fuzzifier factor, m, was set to 2.0 for the
artificial datasets and the Iris, Wine and RCNS datasets. For the Yeast, YeastMIPS and
Serum datasets, m was set to 1.17 and 1.25 respectively, as in [60]. The number of
clusters, c, was set to the known number of clusters. Each algorithm was run 5 times
and the best result recorded. We repeated the experiment 50 times and averaged the
performance of each algorithm using both measures.
53
/.*
48
43
u 38
CO
cc
//
/■'
* /■■*
fzPBI
OCS
—A— NPS
33
/••■
•••♦•• FCMimp
28
/.■■' >
***.. 
CIAO
FCMGOimp
23 
" ""
5 7 10 13 15 20 25 30 40 50
Percentage of missing value
Figure 47 : Average RMSE of 50 trials using an incomplete ASET5 dataset with
different percentages of missing values
Regarding the artificial datasets, ASET2 and ASET5; ASET2 has five well
separated clusters of similar size. ASET5 is more complex, containing three clusters
that differ in size and density (Figure 316). On ASET2, fzPBI had the lowest RMSEs
across different scales of missing data (Figure 46) and therefore performed better than
143
the other methods. On ASET5, fzPBI again performed best (Figure 47), although
performance of the other methods was comparable.
Applied to the Iris dataset, fzPBI and CIAO had the smallest RMSEs, although,
as shown in Figure 48, fzPBI performed marginally better. Table 4.21 shows that,
compared with CIAO, fzPBI had a smaller number of misclassified objects across
different scales of missing values.
2.8
2.3
1.3
0.8
I
//
r
S
.'/
,/
s
/
y
v
//
fzPBI
OCS
ANPS
•••♦■•• FCMimp
CIAO
FCMGOimp
7 10 13 15 20 25 30 40 50
Percentage of missing value
Figure 48 : Average RMSE of 50 trials using an incomplete Iris dataset with different
percentages of missing values
On the Wine dataset, on average, both fzPBI and CIAO outperformed the other
methods. fzPBI and CIAO performed equally on data with percentages of missing
values of 10% and lower. However, fzPBI performed slightly better than CIAO on data
with higher percentages of missing values. FCMimp and FCMGOimp performed better
than other methods, when the percentage of missing values was small (Figure 49).
144
However, they performed worse than all other methods on datasets with high
percentages, i.e., greater than 15%, of missing values.
Table 4.21 : Average results of 50 trials using an incomplete IRIS dataset with different
percentages (%) of missing value
%
Averaged #objects misclassified
fzPBI
PDS
ocs
NPS
FCMimp
CIAO
FCMGOimp
5
15.9
18.7
16.0
16.2
18.4
15.9
18.0
7
15.9
12.6
16.3
16.8
13.0
15.9
13.2
10
15.9
10.0
16.3
17.7
9.8
16.0
9.5
13
16.0
9.1
17.1
19.0
9.5
16.2
9.0
15
16.1
12.8
18.4
20.7
12.2
16.3
9.9
20
15.9
20.6
19.8
22.9
20.3
16.0
23.7
25
16.2
31.9
20.5
24.8
30.8
16.1
32.8
30
16.7
37.9
26.2
30.9
37.9
16.7
38.9
40
16.7
49.3
30.8
37.9
50.8
17.0
57.8
50
21.2
56.5
42.6
52.3
57.1
21.8
64.1
166.5
r'
il
il
146.5
?
fzPBI
uj126.5 
1
OCS
2
cc
t
—A— NPS
* *
•••♦■•• FCMimp
106.5
CIAO
86.5
■y,.^/
"••• v
FCMGOimp
^s
5 7 10 13 15 20 25 30 40 50
Percentage of missing value
Figure 49 : Average RMSE of 50 trials using an incomplete Wine dataset with different
percentages of missing values
145
/
0.211
/
/
0.201
/
0.191
0.181
/
/
>
fzPBI
LU
 0.171
OCS
—A« NPS
0.161
0.151
/
•••♦•• FCMimp
CIAO
/ s<2r
FCMGOimp
0.141
/ 4^
0.131 
^/// — ^
n 1 01 .
U.1Z1 i i i i i i i i i
5 7 10 13 15 20 25 30 40 50
Percentage of missing value
Figure 410 : Average RMSE of 50 trials using an incomplete RCNS dataset with
different percentages of missing values
On the RCNS dataset (Figure 410), fzPBI outperformed other algorithms when
the missing value percentage was 40% and lower. However, it performed marginally
worse with a missing value percentage of 50%. This is because the RCNS dataset
contains only 112 data points and becomes very sparse with high percentages of
missing values; the data probability model was not properly determined, and fzDBI,
therefore, could not perform correctly.
146
2.91 
2.41
E 1.91
2
1.41
0.91
0.41
//
If
y
^
V
y
fzPBI
OCS
— A— NPS
•••♦■•• FCMimp
CIAO
FCMGOimp
, H
5 7 10 13 15 20 25 30 40 50
Percentage of missing value
Figure 411: Average RMSE of 50 trials using an incomplete Yeast dataset with
different percentages of missing values
fzPBI outperformed other methods on the Yeast and YeastMIPS datasets
(Figures 411 and 412); FCMGOimp was run using the GO termbased distance
measure [11]. However, FCMGOimp outperformed only FCMimp. This result is similar
to that reported in [158]. Using GO terms to measure the distance between genes is
interesting. However, a crisp distance measure of {0,1}, where is the distance between
a pair of genes having at least one GO term in common, does not help much with the
problem of missing value imputation.
fzPBI again outperformed other algorithms on Serum dataset with the
percentage of missing value lower than 50% (Figure 413). For the rest of the cases,
fzPBI was only outperformed by CIAO. On average, fzPBI performed best on the
Serum dataset.
147
3.88 h
//
//
3.38
f/
il
2.88
'J
1
i
fzPBI
2.38
LU
CO
or
1.88
1.38
A
y
OCS
—A« NPS
•••♦•• FCMimp
CIAO
FCMGOimp
0.88 ^.^
,— «— — f.f.«B l .nnn.i«r.r..:nr.r.'r:r.r.r.".
0.38
) 7 10 13 15 20 25 30 40 50
Percentage of missing value
Figure 412 : Average RMSE of 50 trials using an incomplete YeastMIPS dataset with
different percentages of missing values
0.17 1
♦
/
/
/
0.15 
/
/
/
0.13 
/
/
A
/■
0.11 
/
/
/
/*
0.09 
*.y ,.t
<
0.07 
5 7 10 13 15 20 25 30 40 50
Percentage of missing value
fzPBI
OCS
—A— NPS
•••♦■•• FCMimp
CIAO
FCMGOimp
Figure 413 : Average RMSE of 50 trials using an incomplete Serum dataset with
different percentages of missing values
148
4.2.7 Density based imputation method (fzDBI) [155]
Similar to fzPBI, the fzDBI algorithm was developed to cluster a dataset X, that
may be incomplete, into c clusters based on fuzzy partition generated using partial
distance measure as in (4.41) and (4.42). The abstract fuzzy partition is derived from the
FCM fuzzy partition using our fzSC method, as follows:
Let = {U,V} be the fuzzy partition generated by FCM, which is a distance
based fuzzy partition describing the data density at the cluster centers. The accumulated
density at the center of cluster Vk, k=l,...,c, is calculated as in (4.11). However,
cannot properly describe the data density at every data point of X (4.2.2). A strong
uniform fuzzy partition is then defined as in (4.15) and (4.16). The data density at every
data point is then computed using (4.17). A set V, V = {v 1? v 2 ,...,v c } of the most
dense data points, Cd, are then selected as cluster candidates. Because V represents the
data densities in the dataset, we construct a densitybased fuzzy partition© , = {U, V} ,
where Uis defined as:
fi H =ZP(v,v k )xui/2;P(v I v k ). (4.48)
1=1 / 1=1
The estimated value of Xy e X M is computed based on both the FCM fuzzy
partition, 0, and the densitybased fuzzy partition, ,
x y =aJu k > kj /ju k n ; + (la)Xu k n :v kj /Xu k n :, (4.49)
k=l / k=l k=l / k=l
149
where a, 0<a<l, indicates the contribution level of each of the two fuzzy partition
models, and 9 , in the missing value imputation. We set a = 0.5 so that both the
models contribute equally. To avoid an oscillation in the missing value imputation,
estimated values are used decreasingly. In contrast with their usage in the clustering
process, their contribution degree in distance measurement during the estimation
process is defined as in (4.46).
fzDBI algorithm
Steps
1) t = 0;c = Vn.
2) Initialize U* randomly w.r.t (3.2).
3) Compute {d 2 (x i9 v k )}i=i,...,n,k=i,...,c as in (4.41) and (4.42).
4) Compute U t+1 using (3.3).
5) Compute V t+1 using (3.4).
6) If (t > T) or ( j£i _ r m \\ < s ) then Stop.
7) Compute {d 2 (x i9 v k )}i=i,..., n ,k=i,...,c using w ■ as in (4.46).
8) Create a strong uniform fuzzy partition as in (4. 1 5) and (4. 1 6).
9) Create a densitybased fuzzy partition of X, 9 , as in (4.48).
10) Estimate X M using (4.49). t=t+l. Go to Step 3.
The difference between fzDBI and the standard FCM algorithm is that, in
addition to a distancebased fuzzy partition, fzDBI generates a densitybased fuzzy
partition which is used with the distancebased one in missing value imputation.
Method testing
fzDBI was tested the same way with fzPBI, compared with the six popular
imputation methods: PDS, OCS and NPS [140]; FCMimp [158]; CIAO [159]; and
150
FCMGOimp [160], fzDBI performed better on datasets of different levels (Figures 414
417).
54
/
49
44
s
s
39 
s
s
fzDBI
uj 34 
<*•
OCS
2
cc
29
—A— NPS
•••♦•• FCMimp
24
CIAO
19
FCMGOimp
:r
14 
5 7 10 13 15 20 25 30 40 50
Percentage of missing value
Figure 414 : (fzDBI) Average RMSE of 50 trials using an incomplete ASET2 dataset
with different missing value percentages
54
*
/
•
49
/
/
/
44
J
f
fzDBI
39
/
LU
I/)
OCS
2
..**"
«A— NPS
34
.S"
•••♦■•• FCMimp
29
i
CIAO
FCMGOimp
24
f '■■ '
 — '
5 7 10 13 15 20 25 30 40 50
Percentage of missing value
Figure 415 : (fzDBI) Average RMSE of 50 trials using an incomplete ASET5 dataset
with different missing value percentages
151
2.75 
2.25
1.75
1.25
0.75
0.25
fzDBI
OCS
— A— NPS
•••♦■•• FCMimp
CIAO
FCMGOimp
5 7 10 13 15 20 25 30 40 50
Percentage of missing value
Figure 416 : (fzDBI) Average RMSE of 50 trials using an incomplete Iris dataset with
different percentages of missing values
3.8
3.3 
f
2.8
t
/
2.3
LU
I/)
1.8
1.3
/
>
y
y
y
y
fzDBI
OCS
—A« NPS
•••♦■•• FCMimp
CIAO
FCMGOimp
0.8 i
0.3 4 1 1 1 1 1 1 1 1 1
5 7 10 13 15 20 25 30 40 50
Percentage of missing value
Figure 417 : (fzDBI) Average RMSE of 50 trials using an incomplete YeastMIPS
dataset with different percentages of missing values
152
4.2.8 Probability based defuzzification (fzPBD) [156]
To generate classification information using fuzzy partition, the fuzzy partition
needs to be defuzzified for crisp classification information of the data objects. A
popular approach to this problem is to use the maximum membership degree principle
(MMD). This may be inappropriate in some applications, particularly applications
where the data are nonuniform, because FCM membership is computed using distance
between the data object and cluster center. Use of membership degree can assign
marginal objects of a large cluster to the immediately adjacent small cluster. Recent
solutions include that of Chuang et al. [150], who proposed using spatial information to
adjust the membership status of every data point using the membership status of its
neighbors. In a similar approach, Chiu et al. [149] used spatial information to determine
the distance between the data point and cluster center. However, these methods require
definition of neighborhood boundaries. Genther et al. [151] proposed defining spatial
information by the clustering structure, not the neighborhood, to compute the distance
from the data point to cluster center.
A common limitation of methods that use spatial information in computing
membership degree is that they have to scale between the actual position information of
the data point and its spatial information. In addition, while use of spatial information is
appropriate for image segmentation, it may not work with generic data cluster analysis
because it is difficult to define neighborhood boundaries.
153
We developed fzPBD that uses the fuzzy partition and the data themselves to
construct a probabilistic model of the data distributions. The model is then applied to
produce the classification information of data points in the dataset.
Method design
We applied the statistics model in our fzBLE method [141] to generate a
probabilistic model of the data distribution using fuzzy partition.
Given a fuzzy partition matrix U, the vector Uk = {uki}i=i,..., n , k=i,...,c, is a
possibility model of the data distribution of Vk on X. We associate {Pk}, the probability
distributions, with the possibility distributions {Uk} of Vk on X, k=l,...,c, using (4.22)
and (4.23). The statistics at Vk, k=l,...,c, are then derived using (4.24), (4.25) and
(4.26).
Using the idea based on the Central Limit Theorem, "The distribution of an
average tends to be normal, even when the distribution from which the average is
computed is decidedly nonnormal'' the data objects can be assumed to follow the
normal distributions at {vk}k=i,...,c, defined as in (4.26). Therefore, the data object Xi is
assigned to the class of Vk, where
P(v k x i ) = max{P(v 1 x i )}. (4.50)
1=1. .c
Because P(v k xO = P(xi,v k )/P(xi) = P(xiv k )xP(v k )/P(xi), an alternative to (4.50) is
P(v k x i ) = max{P(x i  Vl )xP( Vl )}, (4.51)
1=1. .c
where P(vi), 1=1,. . .,c, the prior probability of vi, can be computed using (4.25).
154
fzPBD algorithm
Steps
1) Convert the possibility distributions in U into probability distributions using
(4.22) and (4.23).
2) Construct a probabilistic model of the data distributions using (4.24) and
(4.26).
3) Apply the model to produce the classification information for every data
point using (4.51).
Method testing
We compared fzPBD with the MMD, FCB, NBM and sFCM methods on
ASET2, ASET3, ASET4 and ASET5 using the compactness measure (3.43) and
misclassification measure (3.21) for evaluation of algorithm performance. For each
dataset, the following values of the fuzzifier factor, m were used: 1.25, 1.375, 1.50,
1.625, 1.75, 1.875 and 2.0. The number of clusters, c, was set to the known number of
clusters. The FCM algorithm was run 3 times and the best fuzzy cluster partition was
selected to test all the algorithms. We repeated the experiment 100 times and averaged
the performance of each algorithm using the two measures.
155
■ fzPBD
— ♦ MMD
•••*••• FCB
NBM
sFCM
E 2 
5
Q.
ro
4»
(0
■D
"35
u»
_ro
u
E 1 "
M
o
a
lo.5 ^
Z
n ■
h
H 
h
~H~
^
+"'
....x
..X
(•
x
*••••
X''''
...^ , ''''
u ■
1.
25
1.375
1.5
1.625
Fuzzifier (m)
1.75
1.875
2
Figure 418 : Algorithm performance on ASET2
1
1
^
1
1
• 1
0.4
1 0.35 
Q.
re 0.3 
■a
■o
«S 0.25 
■ fzPBD
u»
— ♦ MMD
8 0.2 
£
•••*•• FCB
*o 15 )
(,
U£
W
NBM
sFCM
£ n 1 
fc 0.1
""""*
••••x
x
•••••X
z
0.05
I
1.25
1.375
1.5
1.625
Fuzzifier (m)
1.75
1.875
2
Figure 419 : Algorithm performance on ASET3
ASET2 and ASET3 each contain five wellseparated clusters of the same sizes.
ASET1 and ASET2 are in twodimensional and threedimensional data space,
respectively. Performance of all algorithms is shown in Figures 418 and 419. fzPBD,
MMD and sFCM generated no misclassification across multiple levels of m.
156
12
10
*••••■
* x
x
• Si'
x
*•••
V.
S
*. VI 1 ■ 1 .
II —
■ i
— ■ —
■ —
— ■—
■
_ro b  » mm
1.25
1.375
1.5
1.625
Fuzzifier (m)
1.75
1.875
■ fzPBD
— ♦ MMD
•••*••• FCB
NBM
sFCM
Figure 420 : Algorithm performance on ASET4
ASET4 contains 11 wellseparated clusters in a fivedimensional data space.
Results are shown in Figure 420. fzPBD and MMD performed best. Table 4.22 shows
that they also had the smallest compactness measures across multiple levels of m.
Table 4.22 : ASET4 Compactness measure
Algo.
Fuzzifier factor m
1.25
1.375
1.5
1.625
1.75
1.875
2.00
fzPBD
0.39
0.39
0.39
0.39
0.39
0.39
0.39
MMD
0.39
0.39
0.39
0.39
0.39
0.39
0.39
FCB
4.52
4.52
4.52
4.52
4.53
4.54
4.54
NBM
12.85
11.28
8.73
8.72
8.72
8.72
8.72
sFCM
4.94
3.2
0.45
0.49
0.49
0.5
0.5
157
ASET5 is nonuniform with three clusters in a twodimensional data space
(Figure 316). Results are shown in Figure 421. fzPBD outperformed all the other
algorithms.
K__
+
__+
 +
H
 +
25 
+■»
C
o
Q.
ro
+■»
ro
a
■a
o
20 
15 
■ fzPBD
^_
#
♦
♦—
+"—'
in
— ♦ MMD
in
E
10 
•••X FCB
NBM
.Q
sFCM
3
z
S
(....
yd
$Kr • • • •
*•
—JK—
x
••••X
I
1.25
1.375
1.5
1.625
Fuzzifier (m)
1.75
1.875
2
Figure 421 : Algorithm performance on ASET5
7.5
c 6.5
£ 6
■o 5.5
a 5
E 4 5
§3.5
2.5
■— fzPBD
♦ MMD
•X FCB
■ NBM
sFCM
1.25 1.375 1.5 1.625 1.75
Fuzzifier (m)
1.875
Figure 422 : Algorithm performance on IRIS
The IRIS dataset contains three clusters corresponding to the three classes of Iris
flowers. Results are shown in Figure 422. NBM outperformed other algorithms. fzPBD
158
performed lightly less well, however, Table 4.23 shows that, compared with NBM,
fzPBD had smaller compactness measures across multiple levels of m.
Table 4.23 : IRIS: Compactness measure
Algo.
Fuzzifier factor
m
1.25
1.375
1.5
1.625
1.75
1.875
2.00
fzPBD
0.44
0.44
0.44
0.44
0.44
0.44
0.44
MMD
0.44
0.44
0.44
0.44
0.44
0.44
0.44
FCB
11.84
11.81
11.78
11.76
11.75
11.75
11.75
NBM
1.75
1.75
1.75
1.74
1.74
1.74
1.74
sFCM
0.44
0.44
0.44
0.44
0.44
0.44
0.44
J2 56 
c
o 4
_ ^*
, ♦
>^
— ♦
— > —
— <r~ "
ro 51
■a
■o
1 46
 
+
^h
h _
 ^ v —
■ fzPBD
— ♦ MMD
u
E
...*••• FCB
*o 41
NBM
sFCM
E 1
6^
...X
X
x.. #
z 36 
''%
X
31 
1.25
1.375
1.5
1.625
Fuzzifier (m)
1.75 1.875
2
Figure 423 : Algorithm performance on WINE dataset
The Wine dataset contains information on 13 attributes of the three classes of
wines. Results are shown in Figure 423. fzPBD and FCB performed better than other
algorithms. fzPBD performed better than FCB with the fuzzifier factor levels less than
159
1.8. Also, Table 4.24 shows that fzPBD performed better than FCB across multiple
levels of m.
Table 4.24 : WINE Compactness measure
Algo.
Fuzzifier factor m
1.25
1.375
1.5
1.625
1.75
1.875
2.00
fzPBD
0.08
0.08
0.08
0.08
0.08
0.08
0.08
MMD
0.05
0.05
0.05
0.05
0.05
0.05
0.06
FCB
0.19
0.18
0.18
0.17
0.16
0.16
0.17
NBM
0.17
0.17
0.17
0.18
0.18
0.18
0.18
sFCM
0.05
0.05
0.05
0.05
0.05
0.05
0.05
108
88
68
£ 48
I 28
1 ^
// r *^
\'r^ '■="■■*"*■"
S w
'
„A
y£ *X* X
w ^
.*
>(• *
■— fzPBD
♦ MMD
•X— FCB
NBM
sFCM
1.25 1.375 1.5 1.625 1.75
Fuzzifier (m)
1.875
Figure 424 : Algorithm performance on GLASS dataset
The Glass dataset contains information on nine attributes of six classes of glass
used in building construction. Results are shown in Figure 424. fzPBD significantly
outperformed all the other algorithms. fzPBD outperformed the other methods on both
160
artificial and real datasets, particularly on the datasets with clusters that differed in size.
fzPBD is, therefore, appropriate for realworld datasets, where usually the data densities
are not uniformly distributed.
4.2.9 Fuzzy genetic subtractive clustering method (fzGASCE) [157]
We have addressed the key problems of the fuzzy CMeans algorithm:
initialization of the fuzzy partition matrix, evaluation of fuzzy clustering results,
missing value imputation and fuzzy partition defuzzification. The experimental results
show that our methods outperformed the existing algorithms addressing the same issues.
Yet another issue with FCM algorithm is that, like most other partitioning approaches, it
cannot by itself determine the number of clusters, and, as in previous studies, its results
depend strongly on initial parameters. For some initial values, FCM will converge
rapidly to a global optimum, but, for others, it may become stuck in a local optimum.
Common approach to address the limitations of FCM is to integrate FCM with the
Genetic Algorithm (GA), where the GA is used to manage a set of solution candidates.
The FCM algorithm is then applied and a cluster validity index is used as the GA fitness
function to search for the best solution. Ghosh et al. [163] and Liu et al. [166] proposed
to use the partition coefficient (PC) [92] and Xie and Beni (XB) [94] validity indices. In
addition, Liu et al. [166] proposed a modified version of the PC index (MPC) in order to
reduce the monotonic tendency of the index. The FukuyamaSugeno cluster index (FS)
[93], which measures the compactness and separation of the cluster partition, was used
by Ghosh et al. [163]. Lianjiang et al. [164], in a novel selfAdaptive Genetic Fuzzy C
Means algorithm (AGFCM), proposed a validity index combining the PC index with
161
total variation of fuzzy partition. Haider, Pramanik and Kar [162] proposed a GA fitness
function (HPK) based on the compactness measure and combined it with an intrainter
validity index for a novel algorithm that automatically determines the number of
clusters in the dataset. Lin et al. [165] proposed a combination of GA and FCM for a
novel method with adaptive cluster validity index (ACVI) based on the intra and inter
measures of the fuzzy partition, where GA is used to solve the tradeoff problem
between these two factors for a better evaluation of the fuzzy partition.
A common limitation of existing methods using GA with FCM is that the GA
fitness functions are based on the cluster validity indices, which usually have a problem
with scaling between the compactness and separation factors. In addition, they use the
maximum membership degree for defuzzification that may be improper, as described in
Section (4.2.8). In this section, we combine FCM with GA and fuzzy SC algorithms for
a novel clustering algorithm, fzGASCE, that automatically determines the number of
clusters in the dataset. The FCM algorithm rapidly determines the exact clustering
prototypes of each solution candidate so that the GA algorithm, managing a set of such
candidates, can select the optimal one. The fuzzy SC method helps the GA algorithm
escape any local optima [157].
The fuzzy subtractive clustering (fzSC) method has advantages over the
traditional one in that it is more efficient from the computational viewpoint. In addition,
it is more appropriate for real world problems, because it does not require additional
information regarding the mountain peak, mountain radii, or the threshold between the
current density and original density for a data object to be selected as a cluster
162
candidate. However, fzSC uses a random fuzzy partition, and may perform worse in
cases where the partition is not the result of a convergent fuzzy partitioning process. For
the artificial dataset, the FCM algorithm can converge very rapidly, but for some real
datasets, because of overlap regions where a data point can belong to two or more
clusters, the FCM algorithm may not converge to a local optimum. The fuzzy strong
uniform criteria may not hold with the new fuzzy partition created from the fuzzy
partition by the FCM algorithm. Hence, the cluster candidates determined by our fzSC
may not be optimal. In addition, a local optimal fuzzy partition may cause the fzBLE
method to improperly pick the best partition from a given set of fuzzy partitions.
Method design
Chromosome
We used chromosome to represent the whole clustering solution. Each
chromosome contains a set of loci each standing for the index of the data point selected
as cluster center. We set the length of chromosomes to Vn, which is assumed the
maximum number of clusters in the dataset.
Crossover operator
The crossover operator is used to produce two new offspring from a given pair
of parents. Both the roulette wheel and tournament selection methods are used
interchangeably to select parents maintaining potentially useful solutions in the current
generation. A twopoint crossover operator with probability P c , P c = 0.5, is used.
Mutation operator
163
The mutation operator is used to make changes in portions of the chromosomes
of newly created members. Because each chromosome encodes data points representing
the cluster centers of a clustering solution, changing of the data points in the
chromosome may improve the clustering quality. We, therefore, propose three different
tasks for the mutation operator: (i) add a data point as a new cluster center, because this
may help to locate a new cluster in a higher density region, (ii) remove a data point to
prevent the inclusion of a sparse cluster, and (iii) replace one data point with another so
that the new cluster is located in a higher density region. These tasks are commonly
used by existing methods. However, they are employed in a random or heuristic way,
and cannot guarantee that the GA algorithm will escape local optima. To address this
issue, we applied our fzSC method. The fuzzy partitions of the parents are used to
estimate the density at every data point using (4.13), (4.14), (4.15), (4.16) and (4.17). At
each time, t, the data point, x* with the highest density, M*, is selected, and the
densities at the remaining data points are updated as in (4.19). Such points, where the
densities change less than a predefined ratio, R M , are considered to be significantly
dense and are used in the mutation operator of fzGASCE. This is instead of using
randomly selected data points, as in existing methods. We chose a value of 0.95 for R M .
The selection of the value of R M does not affect the outcome; however, a low value of
Rm may slow the convergence process of fzGASCE.
Fitness function
Instead of using a cluster validity index for the fitness function, we use our
cluster validity method, fzBLE, for cluster evaluation. For each chromosome, FCM is
164
applied to generate the fuzzy partition, which is then used to generate a probabilistic
model of the data distributions as in (4.25), (4.26) and (4.27). The fitness function,
given a fuzzy partition 0, = {U,V}, is defined:
fitness(0) = log[L(0X)]  log(c), (4.52)
where L(0X), the likelihood of the clustering model and the data, is measured as:
L(0X) = L(U,VX)
n n c (4 53^)
=np(x 1 iu,v)=nzp(v k )xp( Xl iv k >.
i=l i=l k=l
fzGASCE algorithm
Input: data to cluster X={xi}, i=l,. . .,n.
Output: an optimal fuzzy clustering solution,
c: optimal number of clusters.
V = {vi }, i =l,...,c: the cluster centers.
U= {uki} , i= 1 , . . . ,n, k= 1 , . . . ,c : the partition matrix.
Steps
1) Randomly generate a set of chromosomes describing clustering solution
candidates.
2) Compute the fitness value for every chromosome.
3) If the stop criteria are met, then go to Step 7.
4) Apply the crossover operator with the probability P c = 0.5, and the roulette
wheel and tournament parent selection methods.
165
5) Apply the mutation operator with the probability P m = 0.01 . The significantly
dense data points are used in the replacement, and fresh offspring are created
using these points with a probability of P c xP m .
6) Go to Step 2.
7) Select the 'best' chromosome from the population as the clustering solution.
8) Apply [157] for the defuzzification of the fuzzy partition of the solution.
Method testing
To evaluate the performance of fzGASCE, we used four artificial datasets
ASET2, ASET3 and ASET4 each have wellseparated clusters of similar sizes. The
number of clusters and data dimensions of these dataset are (5,2), (5,3) and (11,5),
respectively. ASET5 is more complex, containing three clusters that differ in size and
density (Figure 316). For real datasets, we used the Iris and Wine datasets.
Performance measures
We used three measures, COR, EVAR and MISC, to evaluate algorithm
performance. COR is the correctness ratio, defined as,
COR = ^>c), ( 4  54 )
where N is the number of trials, c and c are the number of clusters and the predicted
number of clusters respectively, and I(.) is defined as:
[ l,x =
I(x) =
0,x*0
166
EVAR is a measure of the accuracy of the predicted number of clusters defined as in
(4.55).
EVAR = ^V( C " £ ) 2  ( 4  55 )
MISC is a measure of the overall performance, determined by the number of data
objects that were misclassified, defined as in (3.21). To be fair, MISC is calculated only
when an algorithm correctly identifies the number of clusters. Then, the assigned cluster
label of each object is compared with its actual cluster label. If any of them do not
match, a misclassification has occurred.
We compared the performance of fzGASCE with those of the eight genetic
algorithm methods that also use FCM, specifically, PBMF, MPC, HPK, AGFCM, XB,
FS, PC and ACVI [162166]. An earlier version of fzGASCE, fzGAE, which did not
include the fuzzy SC and defuzzification methods, was also used. For each dataset, the
number of clusters, c, was set to the known number of clusters. All algorithms were run
using a population size of 24 and a maximum number of 100 generations. The fuzzy
partition of each chromosome was generated using the FCM algorithm with 10
iterations, and the fuzzifier factor, m, was set to 1.25. We repeated the experiment 100
times and averaged the performance of each algorithm using values of COR, EVAR and
MISC.
ASET2 contains five clusters in a 2dimensional data space. The clusters are well
separated and of the same size. Performance of all algorithms is shown in Table 4.25.
All algorithms had very low MISC measures, indicating that they grouped the data
167
points into the correct clusters. However, fzGASCE outperformed all algorithms by all
three measures and fzGAE performed better than the other methods with the exception
of fzGASCE. This comparison illustrates the advantage of using our fzBLE method
[147] in the GA fitness function. The use of fuzzy SC [146] in fzGASCE also improved
its performance, particularly in escaping local optima.
Table 4.25 : Performance of GA algorithms on ASET2
Algorithm
COR
EVAR
MISC
fzGASCE
1.000
0.000
0.000
fzGAE
0.640
0.500
0.000
PBMF
0.510
0.590
0.000
MPC
0.290
0.970
0.000
HPK
0.100
5.010
0.021
AGFCM
0.600
2.800
0.000
XB
0.490
1.450
0.000
FS
0.120
1.100
0.070
PC
0.230
1.040
0.000
ACVI
0.200
2.490
0.011
The ASET3 contains five wellseparated clusters in a 3 dimensional data space
while the ASET4 contains 11 clusters in a 5dimensional data space. Performance of all
algorithms on ASET3 and ASET4 are shown in Tables 4.26 and 4.27 respectively. On
both datasets, fzGASCE outperformed all other algorithms, while fzGAE was the
second best.
168
Table 4.26 : Performance of GA algorithms on ASET3
Algorithm
COR
EVAR
MISC
fzGASCE
1.000
0.000
0.000
fzGAE
0.710
0.380
0.000
PBMF
0.600
0.450
0.000
MPC
0.610
0.860
0.000
HPK
0.120
5.240
0.000
AGFCM
0.650
1.490
0.000
XB
0.640
0.430
0.000
FS
0.520
0.840
0.011
PC
0.620
0.890
0.000
ACVI
0.100
2.100
0.000
Table 4.27 : Performance of GA algorithm on ASET4
Algorithm
COR
EVAR
MISC
fzGASCE
1.000
0.000
0.000
fzGAE
0.450
0.750
0.000
PBMF
0.340
1.000
0.000
MPC
0.420
0.820
0.002
HPK
0.010
1.910
0.037
AGFCM
0.340
2.380
0.000
XB
0.410
0.900
0.000
FS
0.400
0.880
0.000
PC
0.450
0.700
0.000
ACVI
0.170
4.650
0.000
169
The ASET5 dataset is nonuniform with three clusters in a 2dimensional data
space. Table 4.28 shows the algorithm performance. HPK and AGFCM both failed to
determine the number of clusters. fzGASCE not only outperformed the other
algorithms, with COR =1.0, but also was the only algorithm that grouped all of the data
points into the correct clusters. Although fzGAE performed better than the remaining
algorithms, it failed to correctly group data points into clusters, similar to other methods
that used the maximum membership degree for defuzzification.
Table 4.28 : Performance of GA algorithms on ASET5
Algorithm
COR
EVAR
MISC
fzGASCE
1.000
0.000
0.000
fzGAE
0.900
0.100
0.107
PBMF
0.700
0.300
0.107
MPC
0.050
0.960
0.107
HPK
0.000
5.770

AGFCM
0.000
8.470

XB
0.040
0.960
0.107
FS
0.020
3.480
0.107
PC
0.050
0.960
0.107
ACVI
0.080
0.920
0.107
The IRIS dataset contains three clusters corresponding to the three classes of Iris
flowers. The performance of the algorithms on this dataset is shown in Table 4.29. HPK
and AGFCM again completely failed at detecting the number of clusters. fzGASCE
outperformed other algorithms in detecting the number of clusters as well as in
grouping data points into their own clusters.
170
Table 4.29 : Performance of GA algorithms on IRIS
Algorithm
COR
EVAR
MISC
fzGASCE
1.000
0.000
0.033
fzGAE
0.880
0.120
0.040
PBMF
0.860
0.140
0.040
MPC
0.040
0.970
0.160
HPK
0.000
5.720

AGFCM
0.000
8.120

XB
0.050
1.010
0.040
FS
0.390
0.780
0.154
PC
0.080
0.920
0.115
ACVI
0.150
0.850
0.040
Table 4.30 : Performanc of GA algorithms
; on WINE
Algorithm
COR
EVAR
MISC
fzGASCE
1.000
0.000
0.213
fzGAE
0.860
0.140
0.303
PBMF
0.000
2.050

MPC
0.000
2.810

HPK
0.000
6.760

AGFCM
0.000
9.210

XB
0.270
1.010
0.303
FS
0.000
5.720

PC
0.110
0.920
0.303
ACVI
0.090
0.910
0.303
171
The Wine dataset contains information on 13 attributes of three classes of wines.
Results on this dataset are shown in Table 4.30. Only fzGASCE and fzGAE identified
the correct number of clusters, with COR values of 1 and 0.86, respectively. Among the
other algorithms, only XB and ACVI detected the correct number of clusters, but only
with low COR values. Overall, fzGASCE outperformed other methods on both artificial
and real datasets. It performed particularly well on datasets with clusters that differed in
size, not only in predicting the correct number of clusters, but also in grouping the data
points into the correct clusters. fzGASCE is, therefore, appropriate for realworld
problems, where the data densities are not uniformly distributed.
4.3 Motif finding problem
4.3.1 HIGEDA algorithm
We present a new algorithm, HIGEDA [145], applicable to either DNA or
protein sequences, which uses the TCM model and combines the hierarchical geneset
genetics algorithm, HGA (Hong, 2008) with EM and Dynamic Programming (DP)
algorithms to find motifs with gaps. The HGA algorithm helps HIGEDA manage motif
seeds and escape from local optima; the EM algorithm uses the best alignment of a
motif model on the dataset, where the alignments are generated by DP to estimate the
motif model parameters so that the model fits the best conserved forms of the motifs of
interest.
Method design
172
Given A = {ai, a2,..., ai}u{_}, a set of symbols used for sequence encoding and
the gap symbol, 1=20 for protein sequences and 1=4 for DNA sequences, let S = {Si,
S2,.,., S n } be a set of biological sequences based on A. Assume L = {Li, L 2 ,..., L n },
where Li is the length of sequence i, i=l,. . .,n. Assume that the length of shared motifs is
a known value, W.
With mi = LiW+1, the number of possible positions of a motif in sequence i,
denote Z = {zy}, i=l,...,n, j=l,...,mi, as the probability of motif occurrence at position j
in sequence i.
Let be the motif model.
The motif finding problem is to determine and Z such that:
P(S,Z0)>max. (4.56)
Motif model
Of the motif model 0, two components M and B model the motif and non
motif (the background) positions in sequences. A motif is modeled by a sequence of
discrete random variables whose values give the probabilities of each of the different
letters occurring in each of the different positions of a motif occurrence. The
background positions in the sequences are modeled by a single discrete random
variable. The motif model is as follows:
173
b'°
e M i
e M 2
9 al,0
AM
H al,l
AM
W al,2
=
{0 B , M } =
6 a2,0
am
W a2,l
AM
W a2,2
_ 6 al,0
AM
y al,l
AM
W al,2
e M
e
e
w
M
al,w
M
a2,w
6
al,w
(4.57)
where, a ,k, l<k<W, is the probability that symbol a occurs either at a background
position or at position k of a motif occurrence.
5X k =i,vk=i,...,w.
(4.58)
asA
The matrix in (4.57) is a position weight matrix (PWM), but its use differs
from conventional PWMs in that it contains 21 rows for protein motifs and 5 rows for
DNA motifs. The first row stands for the gap symbol, which may occur in a motif, but
not in the first or last positions. The remaining rows stand for residue symbols, 20
amino acid letters or 4 nucleotide letters. W is chosen large enough to accommodate all
possible motif consensus sequences. Figure 37 shows a motif model representing
'AC, 'AGC and 'ATC\
Sequence and motif model alignment with gaps
Given an input subsequence s, the best alignment with gaps of s and is created
using DP algorithm. Because s comes from a real dataset, gaps are allowed only in 0.
The first symbol of s is aligned with the first column in 0. Consecutive symbols from s
are aligned with either gap or symbol columns to achieve the best alignment score. A
conventional dynamic alignment score is the sum of the pair wise alignment scores.
Instead, here, it is the multiplication of all pair wise alignment scores, and hence it is the
174
probability that the best consensus from matches s. To control the occurrence of gaps,
we define PPM, POG and PEG as the reward for a perfect match and the penalties for
opening and extending a gap, respectively. We choose, by empirical experiments,
PPM=1, POG=0.00875, and PEG=0.325. Let U jk be the best alignment score up to the
j th symbol Sj in s and column k of©. Ujk is calculated as in (4.59).
U
jk
u j _ uk _ 1 *pp M *e s %,k=j
u, k _ 1 *PG*(ie s ^ i;k+1 ),j=o,k>j
[maxdJ^ *PPM*O s %,U j;k _ 1 *PG*(l^ +i , k+1 )}
(4.59)
where PG is either PEG or POG depending on the gap status at column k in U. Let
s='ACG' and be as shown in Figure 425. Figure 426 shows the dynamic alignment
matrix {Uy} of s w.r.t 0. The best alignment score is 0.0039 which is found in the
second row and the last column of the matrix. Hence, the best consensus for s=' ACG'
by is 'AC'. 'AC is considered inmotif model and the last symbol 'G' is outmotif
model for subsequence s. The motif model will be adjusted to fit 'AC' instead of
'ACG'. This feature produces the occurrence probabilities for gap symbols in our motif
model even if such a symbol does not appear in the sequences set.
1
2
3

0.00
0.00
0.25
0.00
A
0.02
0.50
0.10
0.04
G
0.02
0.30
0.35
0.02
C
0.03
0.10
0.01
0.90
T
0.03
0.10
0.29
0.04
Figure 425 : A motif model
175
1
2
3
A
0.50
0.0043

C

0.0050
0.0039
G


0.0001
Figure 426 : Dynamic alignment of s='ACG' w.r.t © from Figure 425
Motif occurrence model
Let S 1J be the subsequence of length W at position j in sequence i. Denote I(.) as
the indicator function and Pm(S 1j ) and Pb(S 1j ) as the conditional probabilities that S 1J is
generated using the motif model and the background model respectively.
W L
p M (s') = p(s ij i © m ) = nnfc) I<s " )tk " a)
k=l a=l
W L
p B (s ij )=p B (s ij ie B )=nn( e aor" ,+
=a)
(4.60)
(4.61)
k=l a=l
Let X be the prior probability of motif occurrence at every possible position in
the sequence set. Similarly to the TCM model in MEME [70], the motif occurrence
probability at position j in sequence i is
z ij = "
^*P M (S 1J )
X*P M (S 1J ) + (1^)*P B (S 1J )
The motif logodds score of S 1J is defined by:
los(S 1J ) = log(P M (S 1J )/P B (S 1J )).
S J is considered a motif hit if it satisfies the inequality:
los(S 1J )>log[(lX)A].
(4.62)
(4.63)
(4.64)
176
Gapped motif model
When gaps are allowed in the alignment of S 1J and 0, Pm(S 1j ) and Pb(S 1j ) are
calculated using the recovered version with gaps of S 1J , similar to that in Xie [178]. Let
s m and s c be the inmotif and outmotif parts, respectively. Let s be the best consensus
for S ij by 0.
P M (S 1J ) = P B (s c )*P M (sO = P(s c Z 9 B )*P(s*Z 9 M ), (4.65)
PB(S 1J ) = PB(s c )*P B (s m ) = P(s c Z ? B )*P(s m Z 9 B ). (4.66)
Multiplemotif finding method
To find multiple motifs in given dataset, we use a mask variable M,
M={Mij}i=i j ... >n j=i v . Mm , where My represents the probability of the chance that a motif
occurs again at position j in sequence i. M is initially set to {1}. The modified version
of (4.62) with respect to M is
z™ = P( Zij = 1 1 M) = z, * mm{M 1J+H } . (4.67)
Once a motif is found, its positions are updated to M.
M i(j+k) =M i(j+k) *(lAM i(j+k) ), (4.68)
where AM i(j+k) = P(zij=l)*(Wk)/W, k=0,...,Wl. The update mechanism in (4.68)
allows for multiple overlapping motifs.
Hierarchical geneset genetics algorithm (HGA)
The GA is a global optimization procedure that performs adaptive searches to
find solutions to large scale optimization problems with multiple local optima.
177
Conventional GAs use crossover and mutation operators to escape local optima. These
operators depend strongly on how the probabilities of crossover and mutation are
chosen. Recent improvements in GA have focused on adaptively adjusting operator
probabilities so that the genetics processes quickly escape local optima. However,
setting up adaptive GAs is difficult and most approaches are based on heuristics.
Here we use HGA, a GA improved by Hong [75]. HGA treats a chromosome as
a set of genesets, not a set of genes as in conventional GAs, as a mechanism to escape
local optima. Starting with genesets of the largest size equal to half the chromosome
length and ending with gene sets of size 1, HGA performs crossover and mutation
operations based on geneset boundaries. When the model is (8, k)convergent, HGA
expects to find a global optimum and attempts to escape local optima, if any exist, by
performing genetics operations with the largest size genesets. HGA is most appropriate
to our genetics model, because each geneset represents a set of adjacent columns in the
motif model that represents patterns of adjacent residues in biological sequences.
Genetics operations based on geneset boundaries allow residues to come together as
well as to change simultaneously.
EM algorithm ofHIGEDA
HIGEDA solves the motif finding problem using an EM algorithm to maximize
the likelihood function (4.69) over the entire dataset.
L(S,Z  0) = logP(S,Z  0) > max . (4.69)
Estimation step: Z is estimated using (4.63) (or (4.67)).
178
,(t+l) _
^*P M (S iJ )
^ t *P M (S 1J ) + (lX t )*P B (S 1J )
(4.70)
Maximization step: and X are computed with respect to (4.70),
L(S,Z0) = X;(lz ij )logP B (S ij ) + z ij logP M (S ij )
i=l 1=1
Such that,
+ (1  z { . ) log(l X) + z {] log(X) > max .
(4.71)
dL 1 ^
3a nmtFj^
(4.72)
With respect to constraints (4.58), the objective function is relaxed using Lagrange
multipliers yk, k=0,. . .,W,
w ( x \
L = L(s,z0)+Xy k Ze ak i .
k=0 V a=l y
aio g p M (s ij ) i ap M (s ij )
50
ak
p m (s ij ) ae a M k
p m (s 1j )
W L
nnfc) 1
J(S i , +k ._ 1 =a')
k'=l a'=l
^kVk
I(S lJ+k _, = a) = ' " J+
I(S,H. k i=a)
ak
Hence,
8L
^ I(S ij+k !=a)
<? W ak i=l j=l
So,
ak
ii in
M _ i=l j=l
er k = 
(4.73)
179
Using (4.58) and (4.73)
1 n m
SZE^^S^^a)
M a=l i=l j=l
IX =
a=l Yk
n m 1 n m
_ i=l j=l a=l i=l j=l ^
(4.74)
Tk
Using (4.73) and (4.74),
ZZ* H = a )
^ =  H n m • (475)
i=l j=l
z ij
Similarly,
n m W
ZZ( 1  z «)Z I ( s u*i= a )
0?o = ^ '" H ■ m M • (476)
i=i j=i
Refining model parameters
The maximization step results in model parameters © and X. Because the values
of Z are estimated during the EM processes, a straightforward update to © and X using
(4.70), (4.75) and (4.76) may cause an oscillation in convergence. MEME estimates X
by trying values from \l\m4n J to 1/(W+1). This requires significant computational time
and squanders the maximization step benefit. We apply the gradient descent learning
law to update model parameter, £,, of the model,
4(t+l) = $(t) + Mt x A^, (4.77)
180
where AS, = £, e (t+l)  £,(t) and £,(t), £, e (t+l) and £,(t+l) are the current, estimated and new
values of £,, respectively. The learning rate jLX t may be slightly reduced by processing
time. A popular form of jLX t is as in (4.78), where T is the processing duration.
^^(iX.Ixt) (4  78 >
If the size of the dataset is small, some elements of© may be zero. These are not
appropriate to the Bayesian process and remain zero. To solve this problem, MEME
uses Dirichlet mixture model [34]. While this has a strong mathematical basis, the
drawback lies in how to select the right number of mixture components. In addition, this
number has no meaning in sequence analysis. Instead, we use pseudocount methods.
For DNA, we borrow an added pseudocounts method from Henikoff and Henikoff
[74]. The added portion used is 0.01. For proteins, we propose a method using the motif
model and the substitution probability matrix from BLOSUM62, and a heuristic that the
pseudocounts should be position specific and depend on strong signals [145]. The
pseudocount value for a given symbol a, in column k of © is calculated using Equation
(4.79).
P SC a k =Z l P a/b> (4.79)
b=l
where P a /b is the BLOSUM substitution probability for amino acid a from the
observation of amino acid b. The motif model is then refined using (4.80) in which, a
and P are predefined, and a+P = 1 .
©'a,k = a.0 a ,k+p.psc ak . (4.80)
181
HIGEDA Algorithm
HIGEDA uses HGA to manage its population and to escape local optima.
During the evolution process, members of the current generation are processed with the
EM algorithm a small number of times, and ranked based on their goodness, measured
using a fitness function. The best members are used to create new ones using crossover
and mutation operators. Newly created members replace the worst ones in the current
generation using a tournament selection to form the new generation. This process
repeats until the number of generations is equal to a predefined number, or the current
generation goodness ratio is convergent. The best members from the last generation are
taken as possible motif candidates.
Each member contains a variable X and a chromosome encoding 0, described in
(4.57); each gene in the chromosome represents a column in © that has 21 elements for
a protein motif, or five elements for a DNA motif. There are W+l such genes in each
chromosome. We note that while Bi [71] proposed encoding PWM using chromosomes,
he did not discuss it in detail. Because the length W of a motif is small relative to the
sequence lengths, our genetics model consumes less memory than those of Wei and
Jensen [83] or Bi [71].
The geneset maximum size is lo = W/2, the initial size is 1 and the final size is 1, such
that,
1 = 2q, where 2q < 1 < 2q+l. (4.81)
We use 8=0.05 and k=9 for (8, k)convergent criterion.
182
The crossover operator is used to produce two new children from a given pair of
parents. We use a twopoint geneset crossover operator with probability p c = 0.95.
The mutation operator is used to make changes in some portions of the chromosomes of
newly created members. Because each chromosome encodes a © resulting from an
alignment on the whole sequence set, a one position shift to the left or right may
improve the quality of the alignment. We propose two different tasks for the mutation
operator: (i) change in a geneset: Two rows are selected randomly. Cells corresponding
to the selected rows in the given geneset are exchanged, and (ii) change to whole
chromosome: Genesets in the chromosome are shifted left or right by one position. The
blank genesets are filled with average probability values.
The fitness function of HIGEDA is a combination of the objective function (4.69) and
the posterior scoring function (4.80). The objective function assesses the best fit model,
while the posterior scoring function determines how well the model is distinguished
from the background in the entire dataset. We use the motif posterior scoring approach
of NowakowskiTiuryn [79] and estimate the prior probability of motif occurrence. It
follows that the posterior score of subsequence S 1J at position j in sequence i has the
form:
, pv J 1J Q^) = log XxPm( S ' J)
(l^)xP B (S 1J )
S p (S 1J 0A) = k)g „ ^ / (4.82)
(4.69) and (4.80) are normalized and combined. Both are important to the model
evaluation, but not equally in all contexts. The fitness function is first applied to find
best fit models, and then used with the posterior scoring function to select the most
183
significant model. To this rule, we add a scaling function v(t), which is shown in Figure
427. The fitness function of HIGEDA is then defined as:
v
fit(0) = ^L(S,Z0) + (v M v)££
=1 .1=1
P(Zjj=l)
P(Zij=l)
:S p (S 1J ©A). (4.83)
v M
Figure 427 : v(t) = Vm x T / (T + t 2 ). Vm is the maximum value of v.
To find gapped motifs, in later phases, HIGEDA tries to improve the most
significant model by introducing gaps. This is similar to a local alignment problem, but
instead of aligning every sequence to the sequence set, HIGEDA, using DP, aligns the
sequence to the motif model that describes the local alignment of the sequence set.
While the run time of gapless alignments is O(W), that of DP alignment is 0(W 2 ). By
restricting the use of DP, the speed of HIGEDA is significantly improved.
184
HIGEDA Algorithm [145]
• Input: Sequence set S, motif length W.
• Output: Motif model 0.
Steps
1) Set M={1} once, set geneset parameters using (4.71), and randomly
generate the first generation, set gNo=0.
2) For each member in the current generation, apply EM once. Measure
member goodness using the fitness function (4.73).
3) Select parents from the best members using ranking selection; apply genetics
operator to create new members.
4) Create new generation using tournament selection, gNo++.
5) If the current generation is (8, k)convergent, then
6) save ( 1 ), set 1=10,
7) Apply genetics operators to refresh current generation.
8) Restore ( 1 ), go to Step 2.
9) If stop criteria are met, then
10) set 1 = 1/2, set gNo = 0.
11) Ifl>0,thengotoStep2.
12) Take the best member as the motif candidate; run EM with the candidate
until convergent to obtain the motif model.
13) Output the motif model and update M using (4.68).
14) Stop, or go to Step 1 to find more motifs.
Method testing
We compared HIGEDA with five open source algorithms, GAME2006, MEME
v3.5, GLAM22008, BioProspector2004 and PRATT v2.1, one web application,
GEMFA (http://gemfa.cmh.edu), and for the EM motif algorithm, we used results from
185
Bi [71]. All algorithms were run 20 times on each dataset with the runtimes recorded;
for GEMFA, this was obtained from the website. Default run parameters were used for
all programs. For HIGEDA, the number of generations at each geneset level was 50
and the population size for all geneset levels was 60. For algorithms in Tables 4.31,
4.32 and 4.33, motif lengths were known motif lengths. For Tables 4.34 and 4.35, we
first ran HIGEDA to find statistically significant motifs, and then ran other algorithms
using the lengths of best motifs found by HIGEDA.
Performance measures
As in Bi [71], two quantities are used to evaluate the algorithms: LPC, the letter
level performance coefficient and SPC, the motif site level performance coefficient. If
we denote delta(.) as the indicator function, and Oi and Ai, respectively, as the set of
known and predicted motif positions in sequence i, then,
LPQS) =  Y \A { nOj/lA. uOj , (4.84)
ntT
SPC(S) = V delta(A 1 nO^ Empty) . (4.85)
Simulated DNA datasets
We generated simulated DNA datasets as in Bi [71], with three different
background base compositions: (a) uniform, where A, T, C, G occur with equal
frequency, (b) ATrich (AT = 60%), and (c) CGrich (CG = 60%). The motif string,
GTCACGCCGATATTG, was merged once (or twice, if needed) into each sequence,
after a defined level of the string change: (i) 9% change representing limited divergence
186
(i.e. 91% of symbols are identical to the original string), (ii) 21% change, or (iii) 30%
change (essentially background or random sequence variation).
Table 4.3 1 : Average performance (LPC/SPC) on simulated DNA datasets
Motif identity Algorithm Uniform ATRich CGRich A vg. Runtime
91%
HIGEDA
1.00/1.00
1.00/1.00 1.00/1.00
GEMFA
0.98/1.00 0.98/1.00 1.00/1.00
GAME
0.86/0.88
0.88/0.90 0.91/0.94
MEME
1.00/1.00
1.00/1.00 1.00/1.00
GLAM2
1.00/1.00
1.00/1.00 1.00/1.00
BioPro.
0.99/1.00 0.99/1.00 0.94/0.95
PRATT
0.83/0.95
0.88/1.00 0.46/0.80
EM
0.99/1.00 0.99/1.00 1.00/1.00
32s
22s
2m26s
2s
52s
2s
0.2s
79%
HIGEDA
0.87/0.96
1.00/1.00 1.00/1.00
GEMFA
0.87/0.88
0.87/0.90 0.85/0.89
GAME
0.43/0.55
0.55/0.61 0.64/0.71
MEME
0.95/0.95
1.00/1.00 1.00/1.00
GLAM2
1.00/1.00
0.07/0.25 1.00/1.00
BioPro.
0.86/0.92 0.89/0.95 0.94/0.98
PRATT
0.46/0.70 0.03/0.15 0.09/0.15
EM
0.83/0.87
0.89/0.91 0.87/0.89
32s
18s
2ml Is
2s
57s
2s
0.2s
70%
HIGEDA
0.79/0.81
0.84/0.89 0.65/0.76
GEMFA
0.56/0.65
0.50/0.60 0.52/0.56
GAME
0.14/0.28
0.19/0.34 0.20/0.34
MEME
0.44/0.50
0.75/0.75 0.27/0.30
GLAM2
0.95/0.95
0.01/0.05 0.00/0.05
BioPro.
0.26/0.33
0.39/0.44 0.25/0.33
PRATT
0.31/0.40 0.05/0.10 0.19/0.30
EM
0.38/0.48
0.47/0.58 0.48/0.54
32s
20s
lm55s
2s
lm20s
2s
0.2s
187
In Table 4.31, we compared results with those obtained with seven other
algorithms. When motif sequences are 91% identical, all algorithms performed equally
well regardless of base composition. However, when the identity dropped to 79% or
70%), HIGEDA in general performed as well or significantly better than other
algorithms on all base compositions. Hence, HIGEDA is not significantly affected by
noise.
Finding motifs in biological DNA sequences
Table 4.32 : Average performance (LPC/SPC/run time (seconds)) on eight DNA
datasets (# of sequences/length of motif/# of motif occurrences).
Algo.
E2F(27/ll/25)
ERE(25/13/25)
crp(24/22/18)
arcA(13/15/13)
HIGEDA
0.57/0.96/33s
0.58/0.87/37s
0.75/0.90/22s
0.38/0.53/56s
GEMFA
0.64/0.85/39s
0.74/0.92/43s
0.57/0.88/12s
0.32/0.42/26s
GAME
.24/.90/2m45s
.24/.75/2mlls
.45/.80/2m33s
.05/.10/lm26s
MEME
0.71/0.85/16s
0.68/0.68/4s
0.55/0.68/ls
0.47/0.54/4s
GLAM2
.84/.93/lm28s
.74/.92/lml0s
.54/.64/lml3s
.47/.54/lml7s
PRATT
0.17/0.33/.2s
0.24/0.44/.2s
0.22/0.56/. Is
0.17/0.23/.4s
BioPros.
0. 50/0.63/3s
0.65/0.71/3s
0.37/0.45/ls
0.01/0.01/6s
Algo.
argR(17/18/17)
purR(20/26/20)
tyrR(17/22/17)
ihf( 17/48/24)
HIGEDA
.75/.94/lm22s
.90/.93/lm38s
.34/.41/lml9s
.14/.31/3m44s
GEMFA
0.35/0.38/30s
0.81/0.90/39s
0.32/0.49/27s
0.11/0.26/lm
GAME
.36/.55/2m31
.33/.53/3m23s
.ll/.23/2m39s
.07/.18/10m7s
MEME
0.75/0.94/5s
0.58/0.85/4s
0.43/0.47/3s
0.00/0.00/5s
GLAM2
.38/.47/lm32s
.17/.50/3mlls
.35/.35/lm55s
.12/.33/9m55s
PRATT
0.16/0.29/.2s
0.46/0.95/.2s
0.06/0. 12/.3s
0.06/0.25/.5s
BioPros.
0.62/0.69/8s
0.18/0.20/7s
0.33/0.42/6s
0.05/0.13/17s
188
We used eight DNA transcription factor binding site datasets, two eukaryotic
datasets, ERE and E2F, and six bacterial datasets: CRP, Arc A, ArgR, PurR, TyrR and
IHF. Table 4.32 shows that HIGEDA performed as well or better than other algorithms
on most datasets. One exception is ERE, possibly because it is an example of OOPS
which is the motif finding model used by GEMFA. Also, GLAM2 performed better
than HIGEDA on IHF, which contains a motif of length 48.
Finding motifs in biological protein sequences
We selected twelve protein families from PFAM and Prosite representing motifs
with different levels of sequence specificity, from completely defined to more subtle,
degenerate and gapped.
3
2
1 
OJ
CN CO * kD
Figure 428 : StrepHtriad motif by HIGEDA
Results obtained with HIGEDA are compared to those obtained with other
algorithms (Table 4.33). The gap parameters of all algorithms, if supported, are set to
known gap structures. All algorithms identify the unambiguous DUF356 IHPPAH
motif. For the StrepHtriad motif, the HIDEGA, MEME and GLAM2 consensus
sequences are more specific than the known motif, because of the motif decoding
procedure. Full degeneracy is shown in Figure 428.
189
Table 4.33 : Detection of protein motifs (18, PFAM; 912, Prosite)
Family (#seq, max L)
Algorithm (run time)
Known motif
Predicted motif
1. DUF356 (20,158)
HIGEDA(12s)
MEME (2s)
GLAM2(lm58s)
PRATT (0.1s)
IHPPAH
IHPPAH
IHPPAH
IHPPAH
IHPPxH
2. StrepHtriad (21,486)
HIGEDA (40s)
MEME (5 s)
GLAM2(lm51s)
PRATT (0.1s)
HxxHxH
HGDHYH
HGDHYH
HGDHYH
HxxHxH
3. Excalibur (26,296)
HIGEDA (lm)
MEME (4s)
GLAM2 (2m56s)
PRATT (0.5s)
D x D x DG xx CE
D R D [RNGK] DG [IV] [AG]CE
[WY] Q [GA] [NW] Y Y L K S D
DRDKDGVACE
D x D xxx C
4. Nupretrotrp (14,1475)
HIGEDA (3ml 8s)
MEME (10s)
GLAM2 (6m55s)
PRATT (Is)
GRKIxxxxxRRKx
SGRKIKTAVRRKK
W[DE]C[DE][TV]C[LC][VL]QNK[AP][ED]
SNGKNMFSSSGTS
FSSS[GP1TxxSx(l,2)RK
5. FlagellinC (42, 1074)
HIGEDA (lm29s)
MEME (6s)
GLAM2 (3m35s)
PRATT (0.07s)
NRFxSxIxxL
RA [NDQG] L G A [FV] Q N R
R [AS] [DNQ] L G A [VF] Q N R
RADLGAFQNR
AxQ
6. Xin Repeat (25,3785)
HIGEDA (5m45s)
MEME (lml s)
GLAM2 (6m36s)
PRATT (6s)
GDV[KQR] [TSG]x[RKT] WLFETxPLD
GDV[RK] [ST] [ACT] [RK] WLFETQPLD
KGDV [RK] T [CA][RK]W[LM]FETQPL
HKGDVRTCRWLFETQP
GDVxTxxWxFETxP
7. Planc.Cyt. C (12,1131)
HIGEDA (lm)
MEME (9s)
GLAM2 (3m42s)
PRATT (0.02s)
C{CPWHF} {CPWR}CH{CFYW}
[NHKS Y] C [ AQELMFT V] [ AGS] CH
FSPDGK
RFSPDG
PDxxxL
8. zfC2H2 (115,99)
HIGEDA (2ml 2s)
MEME (45 s)
GLAM2 (19ml 7s)
PRATT (0.1s)
Cx(l5)Cx3#x5#x2Hx(36)[HC]
C[_Y]LRQEK][_C]LEKP][_EGIK]C[_G]K[ARST]F
S[RQK] [KS] S [NHS]L[NKT] [RKST] H [QILKM]R[IST
V]H
EIC[NG]KGFQRDQNLQLHRRGHNLPW
YKC_P_CGK_FS_KSSLT_H_RI_HT
Hx(35)H
190
Table 4.33 (con't.)
Family (#seq, max L)
Algorithm (run time)
Known motif
Predicted motif
9. EGF_2 (16,3235)
HIGEDA(4ml9)
MEME (15s)
GLAM2 (8m)
PRATT (Is)
CxCx2 [GP] [FYW]x(48)C
CLEK][_C][_EILSV]C[NDE][NDQEPS]G
[FWY][AQESTY]G[DS]DCS[GI]
GECxCNxGYxGSDCSI
CNx(09)GECxCNEGWSGDDC
Gx(0l)CxCx5Gx2C
10. LIGCyclin 1(23,1684)
HIGEDA (lm8s)
MEME (18s)
GLAM2 (2m35s)
PRATT (0.5s)
[RK]xLx(0 1 )[FYLIVMP]
[RK]R[RL] [RIL] [DEIFY]
PAPAP
Tx(01)RKP
Kx(01)R
ll.LIG_PCNA(13, 1616)
HIGEDA (lm32s)
MEME (9s)
GLAM2 (4m59s)
PRATT (0.3s)
x(03)x{FHWY} [ILM] {P} {FHILVWYP}
[DHFM][FMY]
[RGIKP] LQK] [_RDKS] [DPST] [IL]
[SV]FFG
GQKTIMSFFS
Qx(01)SIDSFF[K_][R_]
Lx2Sx(24)K
[KMTY]
12. Mod_Tyr_Itam (10,317)
HIGEDA (51s)
MEME (2s)
GLAM2 (3m43s)
PRATT (0.1s)
[DE]x(2)Yx(2)[LI]xx(6 1 2)Yx(2)[LI]
[RDQE] [_S] [_CET] [_IL] [_GM] [_QTY] LQELK] [_D
GK] [EI] [_RS] [_RN] [_RGP]LQ[DGV] [GHT] YxM[CI
Y]Q[NGT]L[ILS]
GKEDDGLYEGLNIDDCATYEDIHM
E H
SLAQKSMDH _SRQ
Lx2Lx(0l)L
HIGEDA and GLAM2 identified the exact motif for the moderately degenerate
Excalibur (Section 2.5.6). Only HIGEDA identified Nupretrotrp motif erring in only
one residue and identifying the key arginine (R) residues. No algorithm identified the
FlagellinC motif, possibly because of the weak motif signal. While all performed well
in finding the Xin repeat motif, only HIGEDA identified the Planctomycete
Cytochrome C motif, erring in only one residue. For the gapped motif ZFC2H2 family,
only HIGEDA identified the key C, H, F and L residues and the spacing between them.
HIGEDA also provided better results on EGF2 and LIGCYCLIN1. No algorithm
191
identified the LIG_PCNA and MODTYRITAM motifs. Together, these results show
that HIGEDA can effectively find more subtle motifs, including those with gaps.
Because HIGADA uses GA to find motif seeds, its run times are longer than those of
MEME and PRATT, but shorter than those of GLAM2, as indicated in Table 4.33.
4.3.2 New motif discovery using HIGEDA
To test HIGEDA in prediction of novel motifs, we selected the following
proteins from PFAM: those containing only a ZZ domain, only a SWIRM domain, a ZZ
plus a Myb domain, and a ZZ plus both a Myb and a SWIRM domain. While ZZ
domains have a consensus sequence in PFAM, SWIRM and MYB proteins are defined
only by experimental demonstration of functional properties and protein sequence
alignments. We ran HIGEDA on these groups varying the output motif length from 4 to
33. We defined statistically significant motifs as those found in more than 85% of
sequences, and having a pvalue less than 0.0001 (computed using Touzet and Varre
[82]. We also ran MEME, GLAM2 and PRATT.
In Table 4.34, we first show that HIGEDA, MEME and PRATT successfully
identified patterns of the known ZZ motif, CX(15)C. No motif was discovered in the
SWIRMcontaining proteins. But HIGEDA, MEME and GLAM2 each discovered a
motif, WxAxEELLLL, common to Myb proteins, which we propose as a novel domain.
Uniquely discovered by HIGEDA, we propose a second novel domain,
GNW[AQ]DIADH[IV]G[NGST], of unknown function, which is specific to the ZZ
SWIRMMyb protein set.
192
Table 4.34 : Motifs of ZZ, Myb and SWIRM domains by the four algorithms
Algorithm
Domain / Found motif
pvalue
% Hit
ZZ (67,5125)
HIGEDA
D[FY]DLC[AQES]xC[EYV]
5.76e7
91.67
C[GHPY]D[FY]DLC[AQES]xC
2.25e6
93.06
MEME
CLICPDYDLC
73.00
GLAM2
HSRDHP[ML] [IL] [QRK]
91.00
PRATT
Cx2C
92.00
SWIRM (49,624): No motif found
ZZMyb (35,2697)
HIGEDA
[ND]W[ST]A[DE]EELLLL
3.81e9
97.14
MEME
WTADEELLLLD
100.0
GLAM2
WTAEEELLLLE
97.00
PRATT
Ex2Lx(45)E
96.00
ZZMybSWIRM (43,1049)
HIGEDA
GNW[AQ]DIADH[IV]G[NGST]
3.6e10
100.0
WGADEELLLLEG
2.4e10
100.0
MEME
WGADEELLLLEG
100.0
GLAM2
KQLCNTLRILPK
85.00
PRATT
Wx( 1 2)D[ADEQ]Ex(23)L[ILV]
100.0
Lastly, we selected the protein families, Presenilin1 and Signal peptide
peptidase, that share the conserved sequence GXGD and are grouped in the same clan
in PFAM, as shown in Table 4.35.
Only HIGEDA successfully discovered motifs containing GLGD in these
families. In addition, two patterns, YDIFWVF, which is also identified by MEME and
GLAM2, and FGT[NDP]VMVTVA[KT], which is also identified by MEME, appear
193
frequently (in 86% of 151 protein sequences) in the Signal peptide peptidase family.
We, therefore, propose these patterns as new motifs of this family.
Table 4.35 : Motifs of Presenilin1 and Signal peptide peptidase by HIGEDA
HIGEDA
Algorithm
Family (# seq, max) / Found motif
pvalue
% Hit
Presenilin1 (86,622)
HIGEDA
GLGDFIFYS
1.3e10
85.00
KLGLGDFIFY
1.6e10
85.00
MEME
HWKGPLRLQQ
66.00
GLAM2
GDFIFYSLVL
85.00
PRATT
Lx(l3)Lx(23)I
100.0
Signal p. Peptidase (151,
,690)
F[AS]MLGLGDIVIPG
8.9e08
88.74
YDIFWVF[GF]T[NDP]
1.7e08
86.75
GLF[IFV]YDIFWVF
8.7e08
86.75
FGT[NDP]VMVTVA[KT]
8.6e08
86.09
FWVFGT[ND]VMV
MEME YDIFWVFGT[NDP]VMV
[AS]MLGLGDIVIPGI
86.00
87.00
92.00
GLAM2
[FL][FI]YD[IV]F[WF]VF[GF]
GL FFYDIFWVFGT
95.00
90.00
PRATT
Lx3F
100.0
HIGEDA integrates HGA, EM and dynamic programming (DP) to find motifs in
biological sequences. HIGEDA uses HGA to manage different motif seeds and to
escape local optima more efficiently than conventional GAs by using geneset levels.
By applying DP and proposing a new technique for aligning sequences to a motif
model, and then using EM with the dynamic alignments, the HIGEDA creates a new
194
way to automatically insert gaps in motif models. Using the gradient descent learning
law, HIGEDA effectively estimates its model parameters without the greedy searches
used in MEME. Using the pseudocounts method based on a simple mechanism and the
BLOSUM62 substitution probability matrix, HIGEDA avoids the small dataset problem
of zeros in motif model elements, without using the Dirichlet mixture prior knowledge
used in MEME and similar approaches. Because HIGEDA uses a set of motif seeds
generated randomly, it outperforms MEME, GLAM2 and several other algorithms in
finding subtle, more degenerate and gapped motifs. Lastly, we have shown that, as a
TCM based algorithm, HIGEDA can identify novel motifs.
195
5. Applications
5.1 Recovery of a gene expression signature
In this application, we apply our methods to recover the estrogen gene expression
signature from the CMAP dataset. The flowchart is shown in Figure 51.
Integrated gene
expression data
Cluster solution
Competitors
fzGASCE : Genetic
algorithm with fSC
and fzBLE
Cluster candidates
Fuzzy partition (1)
Done
fzSC : Fuzzy
Subtractive Clustering
Potential offsprings
Immunized
candidates
Generation
selection
Figure 51 : Gene expression signature prediction
The fzGASCE with fzSC, fzBLE, fzPBD and RPPR methods were applied to
the estrogen treatment datasets in CMAP with the estrogen gene expression signature
[85]. We first preprocessed the CMAP arrays using the RMA method to obtain 564
expression profiles. Using our metaanalysis method, RPPR [169], we selected the
196
expression profiles that were most strongly related to the estrogen expression signature
[25, 26]. Expression profiles from the five experiments, #373, 988, 1021, 1079 and
1113 were selected. We then used the RankProduct method [30, 54] across multiple
experiments to filter probes differentially expressed using a pvalue cutoff at 0.005.
This process selected 204 probes.
The expression levels of these 204 probes across the selected profiles were
extracted and the fzGASCE method clustering was applied using Pearson Correlation
distance. Four clusters were found. Each cluster, a set of probesets, is considered a
candidate estrogen gene expression signature. We applied RPPR to each cluster using
CMAP. Significantly related expression profiles, with cutoffs of 0.5 and 0.5 for
positive and negative scores respectively, for each cluster are shown in Table 5.1.
Table 5.1 : Clustering results with estrogen significantly associated drugs
Cluster
(#probesets)
Associated drugs [CMAP experiments] (+/hitscore)
1 (60) estradiol[373](+0.93), estradiol[1021](+0.85), estradiol[988] (+0.733),
estradiol[1079](+0.69), doxycycline[l 1 13](+0.64),
genistein[1073](+0.63), estradiol[365](+0.63), nordihydroguaiaretic
acid[415](+0.59), alphaestradiol[1048](+0.58), genistein[268](+0.57),
estradiol[414](+0.57), genistein[1015](+0.56), dexamethasone[374](+0.54),
estradiol[121](+0.53), alphaestradiol[990](+0.52), genistein[267](+0.52),
fulvestrant[1076](0.75), fulvestrant[1043](0.71), fulvestrant[985](0.66),
fulvestrant[367](0.57), 5224221 [956](0.55), 15delta prostaglandin
J2[1011](0.54), 5230742[862](0.54), fulvestrant[310](0.52)
2 (53) doxycycline[l 1 13](+0.85), estradiol[988](+0.74), estradiol[1021](0.61),
estradiol[373](0.58), 12,13EODE[1108](+0.56)
3(44) estradiol[988](+0.84), estradiol[1079](+0.80), estradiol[1021](+0.58),
alphaestradiol[990](+0.52)
4 (47) estradiol[1021](+0.92), estradiol[988](+0.66), doxycycline[1113](+0.61),
estradiol[1079](+0.58)
197
Table 5.2 : Estrogen signature and the 'signature' predicted by FZGASCE
Affyid
Selected
expression profiles
predicted
373
988
1021
1079
1113
202437_s_at
1.72
1.92
1.97
1.68
1.75
X
206115_at
1.29
1.29
1.28
1.40
1.54
X
209339_at
1.66
1.80
1.77
1.76
1.82
X
209687_at
1.51
1.66
1.59
1.63
1.65
X
210367_s_at
1.62
1.77
1.64
1.70
1.71
X
210735_s_at
1.71
1.73
1.71
1.81
1.81
X
211421_s_at
1.35
1.55
1.47
1.63
1.59
X
213906_at
1.29
1.48
1.45
1.43
1.62
X
215771_x_at
1.45
1.69
1.71
1.79
1.63
X
215867_x_at
1.69
1.82
1.77
1.70
1.82
X
205879_x_at
1.50
1.81
1.68
1.68
1.95
X
205862_at
1.69
1.81
1.69
1.69
1.72
X
203666_at
1.55
1.61
1.60
1.60
1.68
X
203963_at
1.71
1.88
1.71
1.85
1.81
X
204508_s_at
1.60
1.82
1.76
1.93
1.91
X
204595_s_at
1.78
1.80
1.62
1.75
1.99
X
205380_at
1.31
1.59
1.50
1.64
1.84
X
205590_at
1.69
1.71
1.87
1.82
1.79
X
217764_s_at
1.62
1.83
1.74
1.85
1.84
X
204597_x_at
1.70
1.74
1.75
2.02
1.69

205239_at
1.71
1.73
1.77
1.62
1.81

204497_at
1.79
1.76
1.70
1.79
1.84

Table 5.1 shows that cluster 1 strongly hit the estrogen expression profiles in
CMAP. It is also the only cluster with a significant negative correlation with fulvestrant,
an estrogen antagonist. This cluster is therefore the best candidate for the estrogen gene
expression signature. Interestingly, this cluster contains 19 of 22 probesets from the
198
actual estrogen signature that were significantly regulated in the five selected profiles
(Table 5.2). We therefore conclude that fzGASCE successfully discovered the estrogen
gene expression signature from the CMAP datasets.
5.2 Genedrug association prediction
Our algorithm methods can be combined in a solution model for the problem of
genedrug association prediction. Given a set of genes that may describe completely or
partially the biological phenotype of interest, the problem is to search the CMAP
datasets for drug treatments that directly or indirectly alter their expression.
5.2.1 Model design
Let A be the given set of genes.
Gene set A
Expanded set B

Key gene set B^
Gene expression
signature generation
r
Predicted genedrug
associations
Figure 52 : GeneDrag association prediction
199
Gene set expansion
Given A, a set of
genes describe the
biological phenotype
Search for invivo
interactors of A
Apply KEGG pathway
annotation
r Select the genes having at
least one interaction
partner that is annotated in
KEGG
1
Selected gene set: B
AcB
\^J Genes of the given set A
KEGGpathway annotated
\_y Expanded set: B
Figure 53 : Expansion of gene set A
Because our knowledge of the molecular basis of the phenotype is limited, the set
A probably does not describe the phenotype completely. To address this problem, we
propose to expand the gene set using ProteinProtein Interaction (PPI) and pathway
data. A gene g is added to set A if g interacts with at least one gene in set A, and g
and/or its interaction partner in A is annotated as a component of some pathway. These
criteria are chosen because, while physically interacting proteins are not necessarily
200
components of a biological process, a pair of interacting proteins which are associated
with one pathway is more likely to participate in another pathway.
We first use PPI data from the Human Protein Reference Database (HPRD) to
expand set A. We then use KEGG pathway data to select proteins that meet the criteria.
If a gene g, geA, is annotated as a component of any pathway, we will include genes
that interact with g, whether or not they are annotated as pathway components. If gene g
is not a component of any pathway, its interaction partners will be included only if they
are annotated as components of at least one pathway. These included genes then suggest
a novel potential role for g.
Let B be the expanded gene set. B contains the set A and the proteins that interact
with A, and may mediate the association of proteins in A with some pathways. Every
pair of genes in B is annotated in at least one pathway. For example, consider a set of
three genes, a, b, c, and their interaction partners, d, e, f (Figure 53). The interaction
between the genes f and b is not annotated in any KEGG pathway. Therefore, f is not
selected. The expanded set B then contains the genes a, b, c, d and e.
The key genes
Set B may be large, and it is possible that not all of its members are involved in a
single pathway relevant to the biological phenotype. The next step, therefore, is to
search for the genes that are most likely to interact with genes in A (Figure 54). We
call these the key genes associated with the phenotype. Key genes may be members of
A or the genes that interact with members of A.
201
r
Annotate B using
GO CC terms
Create M B , a CCterms based
distance matrix for genes in B
1
Cluster B using M B to have subgroups
ofB:Bi,B 2 ,...,B k
Select B A from {B i? i=l,k}
where B A is the closest to A
using GO BP based similarity
Radius BA using PPI: BA , create
the expression profiles SA for BA
using random walk and heuristics
Figure 54 : Identification of key genes
We cluster the genes in B into groups where genes of the same group most often
occupy the same cellular location to facilitate the interaction. We use the GO cellular
component (CC) annotations to measure the distances between genes in B and then
apply our clustering methods to classify genes in B into groups, Bi, B 2 ,..., Bk. These
subsets may be the set A itself or have a member that interacts with at least one member
of A and is located within the same cellular locations as that member of A. These sets
Bi, i=l,. . .,k, are not necessarily related to A in a biological process, we therefore search
for B^, where B^ e{Bi}i=i..k, and B^ is functionally closest to A. We address this
problem using a semantic similarity measure based on GO Biological Process (BP)
202
annotations. The selected set B^ should have the highest semantic similarity to A. We
define B A , B A = A u B^ , the key genes of the expanded set B of set A. Since the set B A
was generated by examining known interactors of A that occupy similar cellular
locations and are annotated to pathways, Ba is believed to better describe the biological
phenotype of interest than the set A.
Generate and apply gene expression signature
In this step, we use B A and the CMAP gene expression microarray data to
generate a gene expression signature for the biological phenotype. Because not all the
genes in B A exhibit a change in all expression profiles in the dataset, the key issues in
this step are to select expression profiles and determine the direction change in
expression of genes in B A . We extract the fold change (FC) levels of genes in B A using
all gene expression profiles in CMAP, compute pvalues from FC to filter genes
differentially expressed in each expression profile using Rank Product [30], and conduct
cluster analysis to group all CMAP expression profiles over genes in B A into p groups,
{Pi}i=i..p We choose P A as the group of samples where the genes in B A are the most
changed. We use the selected genes to define the gene expression signature, S A , for the
phenotype. The FC levels of genes in S A are averaged across the expression profiles in
P A to determine the expression direction. S A is the gene expression signature we suggest
for the phenotype.
203
Key genes
Select from CMAP a set
of expression profiles
related to S A : CMAP A
^r
^r
Cluster CMAP A into m
clusters of genes: Ci,...C m
using gene expression levels
Result 1: A set of
drugs/treatments that may target
genes in A
Select C A from {Q, i=l,m}
where C A is the closest to A
using expression levels
Eliminate C A and
clusters of unchanged
genes
Result2: C A provides additional
genes relating to the phenotype
Result3: C C A provides additional
gene groups relating to the
phenotype
Figure 55 : Gene expression signature generation and application
The predicted expression signature, Sa, is then used to identify a subset of
CMAP, CMAPa which contains the expression profiles strongly relevant to the
phenotype. The smallmolecule/drugs used as treatments in CMAP A suggest potential
therapeutics to modulate expression of the given gene set. By conducting cluster
analysis on CMAP A and using the GO BPbased semantic similarity measure, we can
also obtain the set Ca that contains additional genes potentially related to the biological
204
phenotype of interest, and the set C c a of additional groups of genes changed, but
differently that did the genes in the given set A.
5.2.2 Model testing
Intellectual disability (ID) in Down syndrome (DS) ranges from low normal to
severely impaired, and has a significant impact on the quality of life of the individuals
affected and their families. Development of pharmacotherapies for cognitive deficits is,
therefore, an important goal. The working hypothesis in DS research is that the
phenotypic features are caused by the modest 50% increases in expression of trisomic
genes and that these in turn result in modest perturbations of otherwise normal cell
functions and pathways. However, the primary difficulties of targeting therapeutics to
Hsa21q genes are the determinations of the number of genes to be regulated
simultaneously, the number of drugs to be used, and the gene and/or drug interactions
might occur in full trisomies to produce adverse effects [171]. This application will help
with identification of the set of drug treatments that may affect a given set of Hsa21q
genes using the CMAP datasets.
Hsa21 gene set 1
This gene set contains six Hsa21q genes: APP, DYRK1A, ITSN1, RCAN1,
S100B and TIAM1 [171] as shown in Table 5.3.
1) As in Figure 53, to generate the expanded gene set, we used the PPI and
pathway database in the Gardiner's lab [167] which is compiled from
multiple bioinformatics resources [168], (available online at
205
http://gfuncpathdb.ucdenver.edu/iddrc/). We used the HPRD PPIs at "in
vivo" level and the KEGG pathways to filter the genes. The expanded set B
of A is shown in Table 5.4.
Table 5.3 : Hsa21 set1 query genes
No.
Gene
Description
Chr.
1
APP
amyloid beta (A4) precursor protein
21
2
DYRK1A
dualspecificity tyrosine(Y)phosphorylation regulated kinase 1 A
21
3
ITSN1
intersectin 1 (SH3 domain protein)
21
4
RCAN1
regulator of calcineurin 1
21
5
S100B
SI 00 calcium binding protein B
21
6
TIAM1
Tcell lymphoma invasion and metastasis 1
21
2) As in Figure 54, we conduct cluster analysis on B using fzGASCE and
GOfzBLE methods with GOCC annotations. Gene groups are then ranked
based on how close they are to the given gene set A using GOBP based
semantic similarity measurement. Results are shown in Table 5.5. We
choose the first group, which has the highest semantic similarity score to the
set A, for the next steps.
206
Table 5.4 : Expanded set of the Hsa21 set1
Selected proteins in BOLD are annotated in at least one pathway
#
Selected protein
Interactor
1
APP
SERPINA3, ABL1, APBA1, APBA2, APBB1, APBB2,
CALR, CASP3, CDK5, CHRNA7, CLU, CTSD, F12, GRB2,
GSK3B, GSN, HSD17B10, HMOX2, KNG1, KLC1, LRP1,
NF1, PIN1, PPID, PSEN2, SHC1, TGFB1, TGFB2, NUMB,
NAE1, NUMBL, ITM2B, HOMER3, HOMER2, MAPK8IP1,
APBA3, BCAP31, APBB3, KAT5, NCSTN, BACE2,
LDLRAP1, HTRA2, LRP1B, SHC3, NECAB3, COL25A1
2
DYRK1A
CREB1, FOXOl, GLI1, YWHAB, YWHAE, YWHAG
3
ITSN1
DNM1, EPHB2, EPS15, HRAS, SNAP25, SOS1, SNAP23,
WASL
4
RCAN1
MAP3K3, PPP3CA
5
S100B
GUCY2D, MAPT, IQGAP1
6
TIAM1
ANK1, CAMK2G, CD44, EFNB1, EPHA2, HRAS, MYC,
NME1, PRKCA, RAC1, SRC, TIAM1, YWHAG,
MAPK8IP1, MAPK8IP2, PARD3, PPP1R9B
Table 5.5 : Gene clusters of the Hsa21 gene set1 expanded set
Genes in BOLD are in the original gene set A
#
Genes
Similarity
score to A
1
APBA1(9), APBA2(15), APBBl(ll), APBB2(4), APP(21), CHRNA7(15), CLU(8),
CREB1(2), CTSD(ll), DNM1(9), RCAN1(21), DYRK1A(21), EPHA2(1),
EPHB2(1), KNG1(3), LRP1(12), MYC(8), NF1(17), PIN1(19), PPID(4), PSEN2(1),
S100B(21), SNAP25(20), TGFB1(19), TGFB2(1), SNAP23(15), IQGAP1(15),
NUMBL(19), HOMER3(19), KAT5(11), LRP1B(2), PPP1R9B(17)
0.32
2
SERPINA3(14), ABL1(9), ANK1(8), CALR(19), CAMK2G(10), CASP3(4),
CD44(11), CDK5(7), EFNBl(X), EPS15(1), F12(5), FOX01(13), GLI1(12),
GRB2(17), GSK3B(3), GSN(9), GUCY2D(17), HSD17B10(X), HMOX2(16),
HRAS(ll), KLC1(14), MAPT(17), MAP3K3(17), NME1(17), PPP3CA(4),
PRKCA(17), RAC1(7), ITSN1(21), SHC1(1), SOSl(2), SRC(20), TIAM1(21),
YWHAB(20), YWHAE(17), YWHAG(7), NUMB(14), NAE1(16), WASL(7),
ITM2B(13), HOMER2(15), MAPK8IP1(11), APBA3(19), BCAP31(X), APBB3(5),
NCSTN(l), MAPK8IP2(22), BACE2(21), LDLRAPl(l), HTRA2(2), SHC3(9),
PARD3(10), NECAB3(20), C0L25A1(4)
0.26
207
3) As in Figure 55, we search for the probesets of the selected genes and
extract their expression levels. Differential expression significance is set
using a cutoff pvalue at 0.005. We conduct cluster analysis of expression
profiles using our cluster method, fzGASCE where the fuzzifier factor, m, is
set to 1.15. The expression profile clusters are ranked based on the average
fold change of genes that are differentially expressed. The cluster with
highest expressed level is selected.
4) We average the expression levels of the selected probesets (Sa) across
selected profiles and assign the 'Down' label to probesets where the
averaged FC value is less than 1 and vice versa. Results are shown in Table
5.6. We propose these genes with their direction of expression change as the
expression signature for the phenotype most related to the given set A of
Hsa21 genes.
5) We use the gene expression signature in Table 5.6 with RPPR [169] (support
tool is available online [168]) to select the list of drug treatments in CMAP
using a score cutoff at 0.8. Results are shown in Tables 5.7 and 5.8. Our
approach successfully identifies drugs commonly used in treatment of
mental illness. Mebhydrolin is used to treat depression in older ADULT with
ID; cytochalasin B causes filament shortening, however, Krucker et al. [186]
showed that the effect of cytochalasin B can be reversible making stable
longterm potentiation in region CA1 of the hippocampus. Geldanamycin
208
can help with reducing brain injury; Levomepromazine is used to relieve
symptoms of schizophrenia. All of these drugs are listed in Table 5.8.
Table 5.6 : Proposed gene expression signature for the Hsa21 gene set1
(67 g
enes, 147 pr
obesets)
No
Probeset
Sign
Gene
Chr.
No
Probeset
Sign
Gene
Chr.
1
202123 s at
i
ABL1
9
75
206689 x at
i
KAT5
11
2
207087 x at
i
ANK1
8
76
209192 x at
i
KAT5
11
3
208352 x at
i
ANK1
8
77
214258 x at
i
KAT5
11
4
205389_s_at
i
ANK1
8
78
212877_at
i
KLC1
14
5
208353 x at
i
ANK1
8
79
212878 s at
i
KLC1
14
6
205391 x at
i
ANK1
8
80
213656 s at
i
KLC1
14
7
205390 s at
i
ANK1
8
81
217512 at
i
KNG1
3
8
206679 at
i
APBA1
9
82
206054 at
i
KNG1
3
9
215148 s at
i
APBA3
19
83
57082 at
i
LDLRAP1
1
10
205146 x at
i
APBA3
19
84
221790 s at
i
LDLRAP1
1
11
202652 at
i
APBB1
11
85
203514 at
i
MAP3K3
17
12
204650 s at
i
APBB3
5
86
213013 at
i
MAPK8IP1
11
13
211277 x at
i
APP
21
87
213014 at
i
MAPK8IP1
11
14
200602 at
i
APP
21
88
205050 s at
i
MAPK8IP2
22
15
214953 s at
i
APP
21
89
208603 s at
i
MAPK8IP2
22
16
217867 x at
i
BACE2
21
90
20640 l_s_at
i
MAPT
17
17
212952 at
i
CALR
19
91
203928 x at
i
MAPT
17
18
212953 x at
i
CALR
19
92
203929 s at
i
MAPT
17
19
214315 x at
i
CALR
19
93
203930 s at
i
MAPT
17
20
214316 x at
i
CALR
19
94
20243 l_s_at
i
MYC
8
21
200935 at
i
CALR
19
95
202268 s at
i
NAE1
16
22
212669 at
i
CAMK2G
10
96
208759 at
i
NCSTN
1
23
214322 at
i
CAMK2G
10
97
210720 s at
i
NECAB3
20
24
212757 s at
i
CAMK2G
10
98
207545 s at
i
NUMB
14
25
202763 at
i
CASP3
4
99
209073 s at
i
NUMB
14
26
204490 s at
i
CD44
11
100
221280_s at
i
PARD3
10
27
212014_x_at
i
CD44
11
101
210094_s_at
i
PARD3
10
28
209835 x at
i
CD44
11
102
221526 x at
i
PARD3
10
29
210916 s at
i
CD44
11
103
221527 s at
i
PARD3
10
30
212063 at
i
CD44
11
104
202927_at
i
PIN1
19
31
217523 at
i
CD44
11
105
204186 s at
i
PPID
4
32
216056 at
i
CD44
11
106
204185 x at
i
PPID
4
33
204489 s at
i
CD44
11
107
202429 s at
i
PPP3CA
4
34
204247 s at
i
CDK5
7
108
202457 s at
i
PPP3CA
4
35
210123 s at
i
CHRNA7
15
109
202425 x at
i
PPP3CA
4
36
208791 at
i
CLU
8
110
215194 at
i
PRKCA
17
37
208792 s at
i
CLU
8
111
213093 at
i
PRKCA
17
38
222043 at
i
CLU
8
112
215195 at
i
PRKCA
17
39
214513 s at
i
CREB1
2
113
206923 at
i
PRKCA
17
40
204312 x at
i
CREB1
2
114
211373_s_at
i
PSEN2
1
41
204314 s at
i
CREB1
2
115
204262 s at
i
PSEN2
1
42
204313 s at
i
CREB1
2
116
20426 lsat
i
PSEN2
1
43
200766 at
i
CTSD
11
117
208641 s at
i
RAC1
7
44
209033 s at
i
DYRK1A
21
118
209686 at
i
S100B
21
209
Table 5.6 (con't.)
No
Probeset
Sign
Gene
Chr.
No
Probeset
Sign
Gene
Chr.
45
211541 s at
i
DYRK1A
21
119
202376 at
i
SERPINA3
14
46
211079 s at
i
DYRK1A
21
120
214853 s at
i
SHC1
1
47
203499 at
i
EPHA2
1
121
201469 s at
i
SHC1
1
48
217887 s at
i
EPS 15
1
122
217048 at
i
SHC1
1
49
217886 at
i
EPS 15
1
123
206330 s at
i
SHC3
9
50
215961 at
i
F12
5
124
209131 s at
i
SNAP23
15
51
205774 at
i
F12
5
125
209130 at
i
SNAP23
15
52
202724 s at
i
FOXOl
13
126
214544 s at
i
SNAP23
15
53
202723 s at
i
FOXOl
13
127
202508 s at
i
SNAP25
20
54
206646 at
i
GLI1
12
128
202507 s at
i
SNAP25
20
55
215075 s at
i
GRB2
17
129
212777 at
i
SOS1
2
56
209945 s at
i
GSK3B
3
130
212780 at
i
SOS1
2
57
200696 s at
i
GSN
9
131
221284 s at
i
SRC
20
58
214040 s at
i
GSN
9
132
221281 at
i
SRC
20
59
218120 s at
i
HMOX2
16
133
213324 at
i
SRC
20
60
218121 at
i
HMOX2
16
134
203085 s at
i
TGFB1
19
61
217080 s at
i
HOMER2
15
135
203084 at
i
TGFB1
19
62
212983 at
i
HRAS
11
136
209908 s at
i
TGFB2
1
63
203089 s at
i
HTRA2
2
137
220407 s at
i
TGFB2
1
64
211152 s at
i
HTRA2
2
138
209909 s at
i
TGFB2
1
65
213446 s at
i
IQGAP1
15
139
206409 at
i
TIAM1
21
66
210840 s at
i
IQGAP1
15
140
213135 at
i
TIAM1
21
67
200791 s at
i
IQGAP1
15
141
205810 s at
i
WASL
7
68
217731 s at
i
ITM2B
13
142
205809 s at
i
WASL
7
69
217732 s at
i
ITM2B
13
143
217717 s at
i
YWHAB
20
70
209297 at
i
ITSN1
21
144
208743 s at
i
YWHAB
20
71
35776 at
i
ITSN1
21
145
210317 s at
i
YWHAE
17
72
207322 at
i
ITSN1
21
146
210996 s at
i
YWHAE
17
73
210713 at
i
ITSN1
21
147
213655 at
i
YWHAE
17
74
209298 s at
i
ITSN1
21
Table 5.7 : Drugs predicted to enhance expression of Hsa21 gene set1
Drug (concentration)
Target or disease
Estradiol (lOnM)
Treatment of urogenital symptoms associated with post
menopausal atrophy of the vagina and/or the lower urinary
tract.
4,5dianilinophthalimide (lOmM)
Potent inhibitors of amyloid P fibrillization.
Colforsin (50mM)
Novel drug for the treatment of acute heart failure.
Fisetin (50mM)
A compound that has been shown to be able to alleviate
ageing effects in certain model organisms.
210
Table 5.8 : Drugs predicted to repress expression of Hsa21 gene set1
Drug (concentration)
Target or disease
Mebhydrolin (4.8mM)
Enhanced sedation with alcohol or other CNS depressants.
Additive antimuscarinic effects with MAOIs, atropine,
TCAs which are the treatment of major depressive
disorder, panic disorder and other anxiety disorders,
causing increased levels of norepinephrine and serotonin
that act as communication agents between different brain
cells.
Cytochalasin B (20.8mM)
A cellpermeable mycotoxin. It inhibits cytoplasmic
division by blocking the formation of contractile
microfilaments. It inhibits cell movement and induces
nuclear extrusion.
Geldanamycin (l.mM)
A benzoquinone ansamycin antibiotic that binds to Hsp90
and inhibits its function. HSP90 proteins play important
roles in the regulation of the cell cycle, cell growth, cell
survival, apoptosis, angiogenesis and oncogenesis.
Tanespimycin (l.mM)
It helps cause the breakdown of certain proteins in the cell,
and may kill cancer cells. It is a type of antineoplastic
antibiotic and a type of HSP90 inhibitor.
Vorinostat (lO.mM)
A drug that is used to treat cutaneous Tcell lymphoma that
does not get better, gets worse, or comes back during or
after treatment with other drugs. It is also being studied in
the treatment of other types of cancer. Vorinostat is a type
of histone deacetylase inhibitor.
Trichostatin A (l.mM , lOO.nM)
An organic compound that serves as an antifungal
antibiotic and selectively inhibits the class I and II
mammalian histone deacetylase families of enzymes, but
not class III HDACs.
211
Table 5.8 (con't.)
Drug (concentration)
Target or disease
Levomepromazine (9.mM)
Levomepromazine is used in palliative care to help ease
distressing symptoms such as pain, restlessness, anxiety
and being sick. It works on chemical substances acting on
the nervous system in brain. Levomepromazine is also
used to relieve symptoms of schizophrenia. Schizophrenia
is a mental health condition that causes disordered ideas,
beliefs and experiences. Symptoms of schizophrenia
include hearing, seeing, or sensing things that are not real,
having mistaken beliefs, and feeling unusually suspicious.
Hsa21 gene set 2
Given a set of Hsa21 genes: APP, DSCAM, ITSN1, PCP4, RCAN1, TIAM1.
Table 5.9 : Hsa21 set2 query genes
No.
Gene
Description
Chr.
1
APP
amyloid beta (A4) precursor protein
21
2
DSCAM
Down syndrome cell adhesion molecule
21
3
ITSN1
intersectin 1 (SH3 domain protein)
21
4
PCP4
Purkinje cell protein 4
21
5
RCAN1
regulator of calcineurin 1
21
6
TIAM1
Tcell lymphoma invasion and metastasis 1
21
Table 5.10 shows the predicted gene expression signature for Hsa21 gene set 2.
We apply the gene expression signature to search for the drug treatments that may affect
genes in the given gene set using CMAP datasets. A list of drug treatments that
positively regulates genes in the given set is shown in Table 5.1 1. Table 5.12 lists drugs
212
predicted to negatively affect these genes. These include geldanamycin which can help
with reducing brain injury, podophyllotoxin which is used in the treatment of brain
tumors, and chlorpromazine which is used in the treatment of various psychiatric
illnesses.
Table 5.10 : Predicted gene expression signature for the Hsa21 gene set2
(70 genes, 157 probesets)
No.
AffylD
Sign
Gene
Chr.
No.
AffylD
Sign
Gene
Chr.
1
202123 s at
i
ABL1
9
80
210713 at
i
ITSN1
21
2
205390 s at
i
ANK1
8
81
209192_x_at
i
KAT5
11
3
208352 x at
i
ANK1
8
82
214258 x at
i
KAT5
11
4
207087_x_at
i
ANK1
8
83
206689 x at
i
KAT5
11
5
205389 s at
i
ANK1
8
84
212877_at
i
KLC1
14
6
20539 lxat
i
ANK1
8
85
212878_s_at
i
KLC1
14
7
208353 x at
i
ANK1
8
86
213656 s at
i
KLC1
14
8
206679_at
i
APBA1
9
87
206054 at
i
KNG1
3
9
209870 s at
i
APBA2
15
88
217512 at
i
KNG1
3
10
209871 s at
i
APBA2
15
89
57082 at
i
LDLRAP1
1
11
215148 s at
i
APBA3
19
90
221790 s at
i
LDLRAP1
1
12
205146 x at
i
APBA3
19
91
200784 s at
i
LRP1
12
13
202652 at
i
APBB1
11
92
200785 s at
i
LRP1
12
14
212972 x at
i
APBB2
4
93
203514 at
i
MAP3K3
17
15
213419 at
i
APBB2
4
94
213013 at
i
MAPK8IP1
11
16
216750 at
i
APBB2
4
95
213014_at
i
MAPK8IP1
11
17
216747 at
i
APBB2
4
96
208603_s_at
i
MAPK8IP2
22
18
40148 at
i
APBB2
4
97
205050 s at
i
MAPK8IP2
22
19
212970 at
i
APBB2
4
98
20243 l_s_at
i
MYC
8
20
212985 at
i
APBB2
4
99
202268 s at
i
NAE1
16
21
204650 s at
i
APBB3
5
100
208759 at
i
NCSTN
1
22
200602 at
i
APP
21
101
210720 s at
i
NECAB3
20
23
214953 s at
i
APP
21
102
211914 x at
i
NF1
17
24
211277 x at
i
APP
21
103
212676 at
i
NF1
17
25
217867 x at
i
BACE2
21
104
216115 at
i
NF1
17
26
200837 at
i
BCAP31
X
105
204325 s at
i
NF1
17
27
200935 at
i
CALR
19
106
211094 s at
i
NF1
17
28
212952 at
i
CALR
19
107
212678 at
i
NF1
17
29
214315 x at
i
CALR
19
108
210631 at
i
NF1
17
30
212953 x at
i
CALR
19
109
204323 x at
i
NF1
17
31
214316 x at
i
CALR
19
110
211095 at
i
NF1
17
32
214322 at
i
CAMK2G
10
111
207545 s at
i
NUMB
14
33
212669 at
i
CAMK2G
10
112
209073 s at
i
NUMB
14
34
212757 s at
i
CAMK2G
10
113
209615 s at
i
PAK1
11
35
202763 at
i
CASP3
4
114
221280 s at
i
PARD3
10
36
212063 at
i
CD44
11
115
221527_s_at
i
PARD3
10
37
204489 s at
i
CD44
11
116
221526 x at
i
PARD3
10
38
210916 s at
i
CD44
11
117
210094 s at
i
PARD3
10
213
Table 5.10 (con't.)
No.
AffylD
Sign
Gene
Chr.
No.
AffylD
Sign
Gene
Chr.
39
204490 s at
i
CD44
11
118
202927_at
i
PIN1
19
40
212014 x at
i
CD44
11
119
204186 s at
i
PPID
4
41
217523 at
i
CD44
11
120
204185 x at
i
PPID
4
42
209835 x at
i
CD44
11
121
202429 s at
i
PPP3CA
4
43
216056 at
i
CD44
11
122
202425 x at
i
PPP3CA
4
44
204247 s at
i
CDK5
7
123
202457 s at
i
PPP3CA
4
45
210123 s at
i
CHRNA7
15
124
215195 at
i
PRKCA
17
46
208792 s at
i
CLU
8
125
215194 at
i
PRKCA
17
47
208791 at
i
CLU
8
126
213093 at
i
PRKCA
17
48
222043 at
i
CLU
8
127
206923_at
i
PRKCA
17
49
200766 at
i
CTSD
11
128
20426 l_s_at
i
PSEN2
1
50
217341 at
i
DNM1
9
129
204262 s at
i
PSEN2
1
51
215116 s at
i
DNM1
9
130
211373_s_at
i
PSEN2
1
52
211484 s at
i
DSCAM
21
131
208641 s at
i
RAC1
7
53
202711 at
i
EFNB1
X
132
208370_s_at
i
RCAN1
21
54
203499 at
i
EPHA2
1
133
215253 s at
i
RCAN1
21
55
210651 s at
i
EPHB2
1
134
202376 at
i
SERPINA3
14
56
211165 x at
i
EPHB2
1
135
201469 s at
i
SHC1
1
57
209589 s at
i
EPHB2
1
136
214853 s at
i
SHC1
1
58
209588 at
i
EPHB2
1
137
206330 s at
i
SHC3
9
59
217887 s at
i
EPS 15
1
138
214544 s at
i
SNAP23
15
60
217886 at
i
EPS 15
1
139
209130 at
i
SNAP23
15
61
215961 at
i
F12
5
140
209131_s_at
i
SNAP23
15
62
205774 at
i
F12
5
141
202508_s_at
i
SNAP25
20
63
215075_s_at
i
GRB2
17
142
202507 s at
i
SNAP25
20
64
209945 s at
i
GSK3B
3
143
212777 at
i
SOS1
2
65
200696 s at
i
GSN
9
144
212780 at
i
SOS1
2
66
214040 s at
i
GSN
9
145
221281 at
i
SRC
20
67
218121 at
i
HMOX2
16
146
213324 at
i
SRC
20
68
218120 s at
i
HMOX2
16
147
221284 s at
i
SRC
20
69
217080 s at
i
HOMER2
15
148
203084 at
i
TGFB1
19
70
212983_at
i
HRAS
11
149
203085_s_at
i
TGFB1
19
71
202282 at
i
HSD17B10
X
150
220407_s_at
i
TGFB2
1
72
211152 s at
i
HTRA2
2
151
209908 s at
i
TGFB2
1
73
203089 s at
i
HTRA2
2
152
220406 at
i
TGFB2
1
74
217731 s at
i
ITM2B
13
153
209909 s at
i
TGFB2
1
75
217732 s at
i
ITM2B
13
154
213135 at
i
TIAM1
21
76
207322 at
i
ITSN1
21
155
206409 at
i
TIAM1
21
77
209298 s at
i
ITSN1
21
156
205810 s at
i
WASL
7
78
209297 at
i
ITSN1
21
157
205809 s at
i
WASL
7
79
35776 at
i
ITSN1
21
214
Table 5.11: Drugs predicted to enhance expression of Hsa21 gene set2
Drug (concentration)
Target or disease
Estradiol (lOnM)
Treatment of urogenital symptoms associated with post
menopausal atrophy of the vagina and/or the lower urinary
tract.
4,5dianilinophthalimide (lOmM)
Potent inhibitors of amyloid P fibrillization.
Colforsin (50mM)
Novel drug for the treatment of acute heart failure.
Table 5.12 : Drugs predicted to repress expression of Hsa21 gene set2
Drug (concentration)
Target or disease
TrichostatinA(l.mM, lOO.nM)
An organic compound that serves as an antifungal
antibiotic and selectively inhibits the class I and II
mammalian histone deacetylase families of enzymes, but
not class III HDACs.
Geldanamycin (l.mM)
A benzoquinone ansamycin antibiotic that binds to Hsp90
and inhibits its function. HSP90 proteins play important
roles in the regulation of the cell cycle, cell growth, cell
survival, apoptosis, angiogenesis and oncogenesis.
Podophyllotoxin (9.6mM)
Used in the treatment of many cancers, particularly
small cell lung carcinoma and testicular cancer and
brain tumors. It arrests cell growth by inhibiting DNA
topoisomerase II, which causes double strand breaks
in DNA.
Cytochalasin B (20.8mM)
A cellpermeable mycotoxin. It inhibits cytoplasmic,
division by blocking the formation of contractile
microfilaments. It inhibits cell movement and induces
nuclear extrusion.
215
Table 5.12 (con't.)
Drug (concentration)
Target or disease
Propafenone (10. 6mM)
Used to prevent or treat serious irregular heartbeats
(arrhythmia). Belongs to a class of drugs called
antiarrhythmics.
Chlorpromazine (1 1.2mM)
It is used in the treatment of various psychiatric
illnesses and is also used in the management of nausea
and vomiting associated with terminal illness.
Tanespimycin (l.mM)
It helps cause the breakdown of certain proteins in the cell,
and may kill cancer cells. It is a type of antineoplastic
antibiotic and a type of HSP90 inhibitor.
5.3 Drug target prediction
In this application, we search for the set of genes affected by a given set of drug
treatments as follows:
1) Relate the given set of drug treatments with drug treatments in the CMAP
a. Use Drug Bank chemical similarity.
b. Based on overlap of drug targets.
2) Extract expression profiles of the CMAP drug treatments related to the given
drug treatments.
3) Generate the gene expression signature as in Section 5.2.1.
4) Apply RPPR [169] to the gene expression signature to expand the selected
expression profiles.
216
5) Apply our clustering methods to search for the group of genes most
differentially expressed. This set of genes is the answer to the problem.
5.4 Application of HIGEDA into prediction regulatory motifs
HIGEDA is capable of discovering new motifs in a sequence set. In this
application, we use HIGEDA to discover transcription factor binding sites (TFBS) and
miRNA complementary binding sites (MBS) in the genomic and mRNA sequences of
genes within each cluster from Section 5.1. Figure 56 illustrates the proposed method.
Integrated gene
expression data
r
Biological context
~\
^
^
of interest
v )
^
i
f
Cluster analysis
V J
y
f
Gene groups
i
^^
^
^
Motif
^
NCBI DB
w
Scanner
f
^^_
^^
A
V
Gene RefSeqs
i
f
HIGEDA
V J
MBS
y
f
f ~\
/ ^
TFRQ
l3CL[1>
w^
Figure 56 : Transcription factor and mircoRNA binding site prediction using HIGEDA
217
6. Conclusions and future work
Our studies regarding microarray analysis included metaanalysis methods, cluster
analysis of gene expression data, and motif finding in biological sequences. Use of
metaanalysis methods should provide better results in analysis of microarray data
obtained from multiple studies because of a stronger statistical basis, and a reduction in
bias caused by results of any individual study.
In cluster analysis, we showed that soft cluster methods are more appropriate for
gene expression data because it allows genes to belong to multiple clusters, reflecting
their participation in more than one biological process. Use of the FCM algorithm is
optimal because it eliminates problems seen in the EM algorithm of slow convergence
at regions of overlap between adjacent clusters and its assumption about the data
distribution model may violate that of most microarray data that do not follow a specific
distribution model, such as Gaussian or ChiSquared. We also proved that, by using the
possibility to probability model conversion, we can apply probability based model
selection to possibility based models. We therefore benefit from the power of statistical
methods but also effectively model the natural data distributions because of the
flexibility of possibility based models. The advantages of our approach have been
shown in our novel methods of fuzzy cluster validation using either gene expression
data or GO annotations, missing data imputation, fuzzy cluster defuzzification and
cluster number detection.
218
In the motiffinding problem, we showed that a PWM based model is most
appropriate. With the use of the hierarchical genetics algorithm where patterns of sub
sequences in motifs can change simultaneously, the motiffinding algorithm can avoid
local optima. In addition, with the use of the dynamic programming algorithm, gapped
motifs can be discovered.
In this dissertation, we have made the following contributions to bioinformatics
research:
1) Development of a metaanalysis method to effectively filter differentially
expressed genes using external metrics, information based on the specific biological
context, and prior knowledge of existing biological pathways, PPIs and Gene Ontology.
The method is applicable to cases where data come from different laboratories using
different versions of different microarray platforms, and should provide more
biologically relevant results. We further propose a novel use of the Connectivity Map
datasets to predict genedrug associations.
2) Development of a novel clustering algorithm that combines algorithms
developed for FCM parameter initialization, FCM missing data imputation,
defuzzification of FCM fuzzy partition and fuzzy cluster evaluation with the
optimization algorithm, GA. The algorithm is applicable to cluster analysis of gene
expression data without a prior specification of the number of clusters and assumption
of the data distributions. Our clustering method successfully supported specific
applications described in Section 5.
219
3) Development of a novel motiffinding algorithm for the problem of gene
regulatory sequence prediction for groups of genes with similar expression patterns
under some biological conditions of interest, demonstrated in Section 4.3.2.
4) Development of a machine learning based framework for a complete solution to
the problem of microarray data analysis.
5) We have developed a website [168] where the methods of this dissertation are
made available online. The website uses R software as the computational backend so
that our methods can work seamlessly with other statistical, machine learning and data
mining methods available in R packages. We will update our website with our machine
learning based framework to allow microarray data analysis online.
In future work, we will improve our algorithm methods with parallelization and
implement them as web applications to support genedrugdisease association
prediction. These applications will be run on High Performance Computing (HPC) and
accessible online. In addition, we will adapt our methods to other research domains
such as Magnetic Resonance Imaging (MRI) data analysis, Social network information
retrieval, cluster analysis of internet websites and users data. For the MRI data analysis,
our cluster methods can help with image segmentation, marker identification and cluster
analysis. For social network information retrieval, our cluster methods can help with
identification the group of users having similar social behaviors, namely entertainment,
buying and travelling.
220
REFERENCES
[I] Stekel D. (2003) Microarray Bioinformatics, Cambridge University Press.
[2] Kohane I.S., Kho A.T. and Butte AJ. (2003) Microarrays for an Integrative Genomics, MIT Press.
[3] Amaratunga D. and Cabrera J. (2004) Exploration and Analysis of DNA Microarray and Protein
Array Data, John Wiley & Sons, Inc., Hoboken, New Jersey.
[4] Ewens W. and Grant G. (2005) Statisical Methods in Bioinformatics: An Introduction 2ed,
Springer.
[5] Shewhart W.A. and Wilks S.S. (2009) Batch effects and noise in microarray experiments, Wiley.
[6] Barrett T., Suzek T.O., Troup D.B., Wilhite S.E., Ngau W.C., Ledoux P., Rudnev D., Lash A.E.,
Fujibuchi W. and Edgar R. (2005) NCBI GEO: mining millions of expression profiles—database
and tools. Nucleic Acids Res. Vol. 33, D562D566. doi: 10.1093/nar/gki022.
[7] Barrett T., Troup D.B., Wilhite S.E., Ledoux P., Rudnev D., Evangelista C, Kim I.F., Soboleva A.,
Tomashevsky M. and Edgar R. (2007) NCBI GEO: mining tens of millions of expression profiles 
database and tools update. Nucleic Acids Res. Vol. 35, D760D765. doi: 10.1093/nar/gkl887.
[8] Koziol A.J. (2011) Comments on the rank product method for analyzing replicated experiments,
Elsevier FEBS letters, Vol. 584, pp. 941944.
[9] Witten D. and Tibshirani R. (2007) A comparison of fold change and the tstatistic for microarray
data analysis, Technical report, Stanford University.
[10] Deng X., Xu J., Hui J. and Wang C. (2009) Probability fold change: A robust computational
approach for identifying differentially expressed gene lists, Computer methods and Programs in
Biomedicine, Elsevier, Vol. 93, pp. 124139.
[II] Warnat P., Eils R. and Brors B. (2005) Crossplatform analysis of cancer microarray data improves
gene expression based classification of phenotypes. BMC Bioinformatics, Vol. 6, pp. 265.
[12] Xu L., Geman D. and Winslow R. (2007) Largescale integration of cancer microarray data
identifies a robust common cancer signature. BMC Bioinformatics Vol. 8, pp. 275.
[13] Xu L., Tan A.C., Winslow R.L. and Geman D. (2008) Merging microarray data from separate
breast cancer studies provides a robust prognostic test. BMC Bioinformatics, Vol. 9, pp. 125.
221
[14] Ramaswamy S., Ross K.N., Lander E.S. and Golub T.R. (2003) A molecular signature of
metastasis in primary solid tumors. Nat Genet, Vol. 33, pp. 4954.
[15] Bloom G., Yang I.V., Boulware D., Kwong K.Y., Coppola D., Eschrich S., Quackenbush J. and
Yeatman T.J. (2004) MultiPlatform, MultiSite, MicroarrayBased Human Tumor Classification.
Am J. Pathol, Vol. 164, pp. 916.
[16] Shen R., Ghosh D. and Chinnaiyan A.M. (2004) Prognostic meta signature of breast cancer
developed by twostage mixture modeling of microarray data. BMC Genomics, Vol.5, pp. 94.
[17] Jiang H., Deng Y., Chen H.S., Tao L., Sha Q., Chen J., Tsai C.J. and Zhang S. (2004) Joint
analysis of two microarray geneexpression data sets to select lung adenocarcinoma marker genes.
BMC Bioinformatics, Vol 5, pp.81.
[18] Warnat P., Eils R. and Brors B. (2005) Crossplatform analysis of cancer microarray data improves
gene expression based classification of phenotypes. BMC Bioinformatics, pp. 265.
[19] Rao Y., Lee Y., Jarjoura D., Ruppert A.S., Liu C.G., Hsu J.C. and Hagan J.P. (2008) A
Comparison of Normalization Techniques for Microarray Data. Berkeley Electronic Press, Vol. 7.
[20] Vert G., Nemhauser J.L., Geldner N., Hong F. and Chory J. (2005) Molecular Mechanisms of
Steroid Hormone Signaling in Plants. The Annual Review of Cell and Developmental Biology.
[21] Cahan P., Rovegno F., Mooney D., Newman J.C, Laurent III G. and McCaffrey T. (2007) Meta
analysis of microarray results: challenges, opportunities, and recommendations for standardization.
Gene, Elsevier, pp. 121 8.
[22] Irizarry R.A., Warren D., Spencer F., Kim I.F., Biswal S., Frank B.C., Gabrielson E., Garcia J.G.,
Geoghegan J., Germino G., Griffin C, Hilmer S.C., Hoffman E., Jedlicka A.E., Kawasaki E.,
MartinezMurillo F., Morsberger L., Lee H., Petersen D., Quackenbush J., Scott A., Wilson M.,
Yang Y., Ye S.Q. and Yu W. (2005) Multiplelaboratory comparison of microarray platforms. Nat
Methods, Vol. 2, pp.239243.
[23] Ramasamy A., Mondry A., Holmes C.C. and Altman D.G. (2008) Key Issues in Conducting a
MetaAnalysis of Gene Expression Microarray Datasets, PLoS, Vol. 5, pp. 184196.
222
[24] Warnat P., Eils R. and Brors B. (2006) Crossplatform analysis of cancer microarray data improves
gene expression based classification of phenotypes. BMC Bioinformatics. Vol. 6, pp. 265.
[25] Lamb J., Crawford E.D., Peck D., Modell J.W., Blat I.C., Wrobel M.J., Lerner J., Brunet J.P.,
Subramanian A., Ross K.N., Reich M., Hieronymus H., Wei G., Armstrong S.A., Haggarty S.J.,
demons P.A., Wei R., Carr S.A., Lander E.S. and Golub T.R. (2006) The Connectivity Map:
Using GeneExpression Signatures to Connect Small Molecules, Genes, and Disease. Science.
Vol. 313, pp. 192935.
[26] Zhang S.D., Gant T.W. (2008) A simple and robust method for connecting smallmolecule drugs
using geneexpression signatures. BMC Bioinformatics, Vol. 9, pp. 258264.
[27] Xu L. and Jordan M.I. (1996) On convergence properties of the EM algorithm for Gaussian
Mixtures. Neural Computation, Vol. 8, pp. 4091431.
[28] Ma S. and Dai Y. (2011) Principal component analysis based methods in bioinformatics studies,
Bioinformatics, Vol. 10, pp. 19.
[29] Breitling R., Armengaud P., Amtmann A. and Herzyk P. (2004) Rank products: a simple, yet
powerful, new method to detect differentially regulated genes in replicated microarray
experiments. FEBS Letters, Vol. 573, pp. 8392.
[30] Hong F., Breitling R., McEntee C.W., Wittner B.S., Nemhauser J.L. and Chory J. (2006)
RankProd: a bioconductor package for detecting differentially expressed genes in metaanalysis,
Bioinformatics, Vol. 22, pp. 28252827.
[31] Zhang N.R., Wildermuth M.C. and Speed T.P. (2008) Transcription Factor Binding site prediction
with multivariate gene expression data, Annual applied statistic, Vol. 2, pp. 332365.
[32] Alexiou P., Maragkakis M., Papadopoulos G.L., Simmosis V.A., Zhang L. and Hatzigeorgiou A.G.
(2010) The DIANAmirExTra Web Server: From Gene Expression Data to MicroRNA Function,
PLOSONE,Vol. 5,e.9172.
[33] Das M.K. and Dai H.K. (2007) A survey of DNA motif finding algorithms, BMC Bioinformatics,
Vol. 8, S21.
223
[34] Bailey T.L. and Elkan C. (1995). The value of prior knowledge in discovering motifs with MEME.
Proc. Intl. Conf. Intel. Syst. Mol. Biol., 3, 2129.
[35] Kel A.E., KelMargoulis O.V., Farnham P.J., Bartley S.M., Wingender E. and Zhang M.Q. (2001)
Computerassisted identification of cell cyclerelated genes: new targets for E2F transcription
factors. Journal of Molicular Biology, Vol. 25, pp. 99120.
[36] Rhodes D.R., Barrette T.R., Rubin M.A., Ghosh D. and Chinnaiyan A.M. (2002) MetaAnalysis of
Microarrays: Interstudy Validation of Gene Expression Profiles Reveals Pathway Dysregulation in
Prostate Cancer. Cancer Res, Vol. 62, pp. 44274433.
[37] Choi J.K, Yu U., Kim S. and Yoo O.J. (2003) Combining multiple microarray studies and
modeling inter study variation. Bioinformatics, Vol. 19, pp. 8490.
[38] Choi J.K., Choi J.Y., Kim D.G., Choi D.W., Kim B.Y., Lee K.H., Yeom Y.I., Yoo H.S., Yoo O.J.
and Kim S. (2004) Integrative analysis of multiple gene expression profiles applied to liver cancer
study. FEBS Letters, Vol. 565, pp. 93100.
[39] Hu P., Greenwood CM. and Beyene J. (2005) Integrative analysis of multiple gene expression
profiles with quality adjusted effect size models. BMC Bioinformatics, Vol. 6, pp. 128.
[40] Marot G. and Mayer CD. (2009) Sequential Analysis for Microarray Data Based on Sensitivity
and MetaAnalysis. Berkeley Electronic Press, Vol. 8.
[41] Strimmer K. and OpgenRhein R. (2007) Accurate Ranking of Differentially Expressed Genes by a
DistributionFree Shrinkage Approach. Berkeley Electronic Press, Vol. 6.
[42] Lin S. (2010) Space Oriented RankBased Data Integration. Berkeley Electronic Press, Vol.9.
[43] Parkhomenko E., Tritchler D. and Beyene J. (2009) Sparse Canonical Correlation Analysis with
Application to Genomic Data Integration. Berkeley Electronic Press, Vol.8.
[44] Su Z., Hong H., Fang H., Shi L., Perkins R. and Tong W. (2008) Very Important Pool (VIP) genes
 an application for microarraybased molecular signatures. BMC Bioinformatics, Vol. 9, S9.
[45] Rossell D., Guerra R., Scott C (2008) SemiParametric Differential Expression Analysis via
Partial Mixture Estimation. Berkeley Electronic Press, Vol. 7.
224
[46] Zhang S. (2006) An Improved Nonparametric Approach for Detecting Differentially Expressed
Genes with Replicated Microarray Data. Berkeley Electronic Press, Vol. 5.
[47] Tan F., Fu X., Zhang Y. and Bourgeois A.G. (2006) Improving Feature Subset Selection Using a
Genetic Algorithm for Microarray Gene Expression Data. IEEE Congress on Evolutionary
Computation.
[48] Choi J.K., Choi J.Y., Kim D.G., Choi D.W., Kim B.Y., Lee K.H., Yeom Y.I., Yoo H.S., Yoo O.J.
and Kim S. (2004) Integrative analysis of multiple gene expression profiles applied to liver cancer
study. FEBS Letter, Vol. 565, pp. 93100.
[49] Huttenhower C, Hibbs M., Myers C. and Troyanskaya O.G. (2006) A scalable method for
integration and functional analysis of multiple microarray datasets. Bioinformatics, Vol. 1, pp.
28902896.
[50] RodriguezZas S.L., Ko Y., Adams H.A. and Southey B.R. (2008) Advancing the understanding of
the embryo transcriptome coregulation using meta, functional, and gene network analysis tools.
Reproduction. Vol. 135, pp. 213236
[51] Burguillo F.J., Martin J., Barrera I. and Bardsley W.G. (2010) Metaanalysis of microarray data:
The case of imatinib resistance in chronic myelogenous leukemia. Comput Biol Chem, Vol. 34(3),
pp. 184192.
[52] Kadota K., Nakai Y. and Shimizu K. (2009) Ranking differentially expressed genes from
Affymetrix gene expression data: methods with reproducibility, sensitivity, and specificity.
Algorithms Mol Biol. Vol. 22.
[53] Xu L., Tan A.C., Winslow R.L. and Geman D. (2008) Merging microarray data from separate
breast cancer studies provides a robust prognostic test. BMC Bioinformatics, Vol. 27, pp. 9.
[54] Hong F. and Breitling R. (2008) A comparison of metaanalysis methods for detecting
differentially expressed genes in microarray experiments. Bioinfomatics, Vol. 24, pp. 374382.
[55] Yuan T., Guixia L., Ming Z., Yi Z. and Chunguang Z. (2010) In Proc. of the 2010 2nd
International on Signal Processing Systems (ICSPS), Vol. 3, pp. 484488.
225
[56] Xu L., Geman D. and Winslow R.L. (2007) Largescale integration of cancer microarray data
identifies a robust common cancer signature. BMC Bioinformatics, Vol. 8, pp. 275+.
[57] Romdhane L.B., Ayeb B. and Wang S. (2002) On computing the fuzzifier in iFLVQ: A data
driven approach, International Journal of Neural Systems, Vol. 12, pp. 149157.
[58] Bezdek J.C., Pal L.R. and Hathaway R.J. (1996) Sequential competitive learning and the Fuzzy C
means clustering algorithms. Neural Networks, Vol. 9, pp. 787796.
[59] Bezdek J.C. and Pal N.R. (1995) Two soft relatives of learning vector quantization. Neural
Networks, Vol. 8, pp. 729743.
[60] Dembele D. and Kastner P. (2003) Fuzzy Cmeans method for clustering microarray data.
Bioinformatics, Vol. 19, pp. 973980.
[61] Schwammle V. and Jensen O.N. (2010) A simple and fast method to determine the parameters for
fuzzy CMeans cluster analysis, Bioinformatics, Vol. 26, pp. 28412848.
[62] Zhao Q., Li Q. and Xing S. (2010) FCM Algorithm Based on the Optimization Parameters of
Objective Function Point, Proc. of 2010 International Conference on Computing, Control and
Industrial Engineering, Vol. 2.
[63] Chiu S.L. (1994) Fuzzy Model Identification Based on Cluster Estimation. Journal of Intelligent
and Fuzzy Systems, Vol. 2, pp. 267278.
[64] Li J., Chu C.H., Wang Y. and Yan W. (2002) An Improved Fuzzy Cmeans Algorithm for
Manufacturing Cell Formation, IEEE.
[65] Liu W., Xiao C.J., Wang B.W., Shi Y. and Fang S.F. (2003) Study On Combining Subtractive
Clustering With Fuzzy CMeans Clustering, In Proc. of the Second International Conference on
Machine Learning and Cybernetics.
[66] Yang M.S. and Wu A.K.L. (2005) A modified mountain clustering algorithm, Pattern Anal Applic,
Vol. 8, pp. 125138.
[67] Yang Q., Zhang D. and Tian F. (2010) An initialization method for fuzzy cmeans algorithm using
subtractive clustering. In Proc. of 2010 Third International Conference on Intelligent Networks and
Intelligent Systems.
226
[68] Cheng J.Y., Quin C. and Jia J. (2008) A weighted mean Subtractive Clustering, Info. Technology
Journal, Vol. 7, pp. 356360.
[69] Collins F.S., Green E.D., Guttmacher A.E. and Guye M.S. (2003) A vision for the future of
genomics research. Nature Vol. 422, pp. 835847
[70] Bailey T.L. and Elkan C. (1995) The value of prior knowledge in discovering motifs with MEME.
Proc. Intl. Conf. Intel. Syst. Mol. Biol. Vol. 3, pp. 2129.
[71] Bi C. (2007) A geneticbased EM motif finding algorithm for biological sequence analysis. Proc.
IEEE Symp. Comput. Intel. Bioinfo. Comput. Biol. pp. 275282.
[72] Chang X., Zhou W., Zhou C. and Liang Y. (2006) Prediction of transcription factor binding sites
using genetic algorithm. 1st Conf. Ind. Elec. Apps. pp. 14.
[73] Frith M.C., Saunders N.F., Kobe B. and Bailey T.L. (2008) Discovering sequence motifs with
arbitrary insertions and deletions. PLoS Comput. Biol. Vol. 4, el 000071
[74] Henikoff J.G. and Henikoff S. (1996) Using substitution probabilities to improve positionspecific
scoring matrices. Comp. App. Biosci. Vol. 12, pp. 135143.
[75] Hong T.P. and Wu M.T. (2008) A hierarchical geneset genetic algorithm. J. Comp. Vol. 3, pp. 67
75.
[76] Li L., Liang Y. and Bass R.L. (2007) GAPWM: a genetic algorithm method for optimizing a
position weight matrix. Bioinformatics, Vol. 23, pp.1 1881 194.
[77] Li L., Bass R.L. and Liang Y. (2008) fdrMotif: identifying ciselements by an EM algorithm
coupled with false discovery rate control. Bioinformatics, Vol. 24, pp. 629636.
[78] Liu D, Xiong X., Dasgupta B and Zhang H. (2006) Motif discoveries in unaligned molecular
sequences using selforganizing neural networks. IEEE Trans. Neural Networks, Vol. 17, pp.9 19
928.
[79] Nowakowski S. and Tiuryn J. (2007) A new approach to the assessment of the quality of
predictions of transcription factor binding sites. J. Biomed. Info. Vol. 40, pp. 139149.
[80] Osada R., Zaslavsky E. and Singh M. (2004) Comparative analysis of methods for representing
and searching for transcription factor binding sites. Bioinformatics, Vol. 20, pp.35 163525.
227
[81] Pisanti N., Crochemore M. Grossi R. and Sagot M.F. (2005) Bases of motifs for generating
repeated patterns with wildcards. IEEE/ACM Trans. Comput. Biol, and Bioinfo. Vol. 2, pp.4050.
[82] Touzet H. and Varre J.S. (2007) Efficient and accurate Pvalue computation for position weight
matrices. Algorithms for Mol. Biol. Vol. 2, pp. 1526.
[83] Wei Z. and Jensen S.T. (2006) GAME: Detecting cisregulatory elements using a genetic
algorithm. Bioninformatics Vol. 22, pp. 15771584.
[84] Frank A. and Asuncion A. (2010) Machine Learning Repository, [Online],
http://archive.ics.uci.edu/ml.
[85] Cho R.J., Campbell M.J., Winzeler E.A., Steinmetz L., Conway A., Wodicka L., Wolfsberg T.G.,
Gabrielian A.E., Landsman D., Lockhart D.J. and Davis R.W. (1998) A genomewide
transcriptional analysis of the mitotic cellcycle. Molecular Cell, Vol. 2, pp. 6573.
[86] Mewes H.W., Frishman D., Giildener U., Mannhaupt G., Mayer K., Mokrejs M., Morgenstern B.,
Munsterkotter M., Rudd S. and Weil B. (1999) MIPS: a database for protein sequences and
complete genomes, Nucleic Acids Research, Vol. 27, pp. 4448.
[87] Wen X., Fuhrman S., Michaels G.S., Carr D.B., Smith S., Barker J.L. and Somogyi R. (1998)
Largescale temporal gene expression mapping of central nervous system development.
Proceeding of the National Academy of Science USA, Vol. 95, pp. 334339.
[88] Troyanskaya O., Cantor M., Sherlock G., Brown P., Hastie T. and Tibshirani R. (2011) Missing
value estimation methods for DNA microarrays. Bioinformatics, Vol. 17, pp. 520525.
[89] Shi F., Abraham G., Leckie C, Haviv I. and Kowalczyk A. (2011) Metaanalysis of gene
expression microarrays with missing replicates, BMC Bioinformatics, Vol. 12, pp. 8499.
[90] Pedro J., Curado J. and Church G.M. (2009) Metaanalysis of agerelated gene expression profiles
identifies common signatures of aging, Bioinformatics, Vol. 25, pp. 875881.
[91] Zhang Z., Chen D. and Fenstermacher D. (2007) Integrated analysis of independent gene
expression microarray datasets improves the predictability of breast cancer outcome, BMC
Genomics, Vol. 8, pp.33 1343.
228
[92] Bezdek J.C. (1981) Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press,
NewYork.
[93] Fukuyama Y. and Sugeno M. (1989) A new method of choosing the number of clusters for the
fuzzy cmeans method. In Proc. of the Fifth Fuzzy Systems Symp., pp. 247250.
[94] Xie X.L. and Beni G. (1991) A validity measure for fuzzy clustering, IEEE Trans. Pattern Anal.
Mach. Intell, Vol. 13, pp. 841847.
[95] Rezaee M.R., Lelieveldt B.P.F and Reiber J.H.C (1998) A new cluster validity index for the fuzzy
cmean, Pattern Recognition Letter, Vol. 19, pp. 237246.
[96] Pakhira M.K., Bandyopadhyay S. and Maulik U. (2004) Validity index for crisp and fuzzy clusters,
Pattern Recognition, Vol. 37, pp. 481501.
[97] Rezaee B. (2010) A cluster validity index for fuzzy clustering, Fuzzy Sets and Systems, Vol. 161,
pp. 30143025.
[98] Florea M.C., Jousselme A.L., Grenier D. and Bosse E. (2008) Approximation techniques for the
transformation of fuzzy sets into random sets, Fuzzy Sets and Systems, Vol. 159, pp. 270288.
[99] Xu L. and Jordan M.I. (1996) On convergence properties of the EM algorithm for Gaussian
mixtures, Neural Computation, Vol. 8, pp. 129151.
[100] Somogyi R., Wen X., Ma W. and Barker J.L. (1995) Developmental kinetic of GLAD family
mRNAs parallel neurogenesis in the rat, Journal of Neurosciences, Vol. 15, pp. 25752591.
[101] Wen X., Fuhrman S., Michaels G.S., Carr G.S., Smith D.B., Barker J.L., Somogyi R. (1998) Large
scale temporal gene expression mapping of central nervous system development. In Proc. of the
National Academy of Science USA, Vol. 95, pp. 334339.
[102] Yeung K.Y., Haynor D.R. and Ruzzo W. (2001) Validating clustering for gene expression data.
Bioinformatics, Vol. 17, pp. 309318.
[103] Yeung K.Y., Fraley C, Murua A., Raftery A.E. and Ruzzo W.L. (2001) Model based clustering
and data transformations for gene expression data, Bioinformatics, Vol. 17, pp. 977987.
[104] Mewes H.W., Hani J., Pfeiffer F. and Frishman D. (1998) MIPS: A database for protein sequences
and complete genomes, Nucleic Acids Research, Vol. 26, pp. 3337.
229
[105] Liu W.Y., Xaio C.J., Wang B.W., Shi Y. and Fang S.F. (2003) Study On Combining Subtractive
Clustering With Fuzzy CMeans Clustering, in Proc. of Machine Learning and Cybernetics,
International Conference, Xi'an, pp. 26592662.
[106] Yang M.S. and Wu K.L. (2005) A modified mountain clustering algorithm, Pattern Anal Applic,
Vol. 8, pp. 125138.
[107] Chen J.Y., Quin Z. and Jia J. (2008) A weighted mean subtractive clustering algorithm,
Information Technology, Vol. 7, pp. 356360.
[108] Tuan C.C., Lee J.H. and Chao S.J. (2009) Using Nearest Neighbor Method and Subtractive
ClusteringBased Method on AntennaArray Selection Used in Virtual MIMO in Wireless Sensor
Network, in Proc. of the Tenth International Conference on Mobile Data Management: Systems,
Services and Middleware, Taipei, pp. 496501.
[109] Collazo J.C., Aceves F.M., Gorrostieta E.H., Pedraza J.O., Sotomayor A.O. and Delgado M.R.
(2010) Comparison between Fuzzy Cmeans clustering and Fuzzy Clustering Subtractive in urban
air pollution, in Proc. of Electronics, Communications and Computer (CONIELECOMP), 2010
20th International Conference, Cholula, pp. 174179.
[110] Yang Q., Zhang D. and Tian F. (2010) An initialization method for Fuzzy Cmeans algorithm
using Subtractive Clustering, Third International Conference on Intelligent Networks and
Intelligent Systems, Vol. 10, pp. 393396.
[Ill] Li J., Chu C.H., Wang Y. and Yan W. (2002) An Improved Fuzzy Cmeans Algorithm for
Manufacturing Cell Formation, Fuzzy Systems, Vol. 2, pp. 15051510.
[112] Loquin K. and Strauss O. (2008) Histogram density estimators based upon a fuzzy partition",
Statistics and Probability Letters, Vol. 78, pp. 18631868.
[113] Houte B.P.P. and Heringa J. (2010) Accurate confidence aware clustering of array CGH tumor
profiles, Biolnformatics, Vol. 26, pp. 614.
[114] Klinge CM. (2001) Estrogen receptor interaction with estrogen response elements. Nucleic Acids
Research, Vol. 29, pp.29052919.
230
[115] Salama R.A. and Stekel DJ. (2010) Inclusion of neighboring base interdependencies substantially
improves genomewide prokaryotic transcription factor binding site prediction. Nucleic Acids
Research, Vol. 38, pp. el 35.
[116] Wang H., He X., Band M., Wilson C. and Liu L. (2005) A study of interlab and interplatform
agreement of DNA microarray data. BMC Genomics, Vol. 6, pp. 71.
[117] Fisher R.A. (1925) Statistical methods for research workers, 13th ed. Oliver & Loyd, London.
[118] Tavazoie S., Hughes J.D., Campbell M.J., Cho R.J. and Church G.M. (1999) Systematic
determination of genetic network architecture. Nat. Genet. 22, 281285.
[119] DayangacErden D., Bora G., Ayhan P., Kocaefe Q., Dalkara S., Yelekci K., Demir A.S. and
ErdemYurter H. (2009), Histone Deacetylase Inhibition Activity and Molecular Docking of (E)
Resveratrol: Its Therapeutic Potential in Spinal Muscular Atrophy. Chemical Biology & Drug
Design, Vol. 73, pp. 355364.
[120] Faumont N., DurandPanteix S., Schlee M., Gromminger S., Schuhmacher M., Holzel M., Laux
G., Mailhammer R., Rosenwald A., Staudt L.M., Bornkamm G.W. and Feuillard J. (2009) cMyc
and Rel/NFkappaB are the two master transcriptional systems activated in the latency III program
of EpsteinBarr virusimmortalized B cells. Journal of Virology, Vol. 83, pp. 50145027.
[121] Dunn J.C. (1974) A fuzzy relative of the ISODATA process and its use in detecting compact well
separated clusters, J. Cybernet, Vol. 3, pp. 3257.
[122] Hoppner F., Klawomm F., Kruse R. and Runkler T. (2000) Fuzzy cluster analysis  methods for
classification, data analysis and image recognition, John Wiley and Son Ltd.
[123] Dunn J. (1974) Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics, Vol.
4, pp. 95104.
[124] Wang W. and Zhang Y. (2007) On fuzzy cluster validity indices, Journal of Fuzzy Sets and
Systems, Vol. 158, pp. 2095  2117.
[125] Gath I. and Geva A.B. (1989) Unsupervised Optimal Fuzzy Clustering, Pattern Analysis and
Machine Intelligence, Vol.1 1, pp.773781.
231
[126] Krishnnapuram R. and Freg C.P. (1992) Fitting an unknown number of lines and planes to image
data through compatible cluster merging. Pattern Recognition, Vol. 25, pp. 385400.
[127] Kim Y., Kim D., Lee D. and Lee K.H. (2004) A cluster validation index for GK cluster analysis
based on relative degree of sharing. Information Science, 168, pp. 225242.
[128] Kaymak U. and Babu^ska R. (1995) Compatible cluster merging for fuzzy modeling, In Proc.
IEEE Int. Conf. Fuzzy System, Yokohama, Japan, Mar. 1995, pp. 897904.
[129] Tasdemir K. and Merenyi E. (2011) Exploiting Data Topology in Visualization and clustering of
SelfOrganizing Maps, Neural Networks, Vol. 20, pp. 549562.
[130] Kaymak U. and Setnes M. (2002) Fuzzy Clustering With Volume Prototypes and Adaptive Cluster
Merging, Fuzzy Systems, Vol. 10, pp. 70571 1.
[131] lamon N., Boongoen T. and Garrett S. (2010) LCE: a linkbased cluster ensemble method for
improved gene expression data analysis, Bioinformatics, Vol. 26, pp. 15131519.
[132] Valente de Oliveira J. and Pedrycz W. (2007) Advances in Fuzzy Clustering and its Applications,
John Wiley & Sons Ltd.
[133] Bezdek J.C. (1980) A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms,
Pattern analysis and Machine intelligence, Vol. 2, pp. 18.
[134] Bezdek J.C, Hathaway R.J., Sabin M.J. and Tucker W.T. (1987) Convergence Theory for Fuzzy c
Means: Counterexamples and Repairs, Systems, Man AND Cybernetics, Vol. 17, pp. 873877.
[135] Cohen J., Cohen P., West S.G. and Aiken L.S. (2002) Applied multiple regression/correlation
analysis for the behavioral sciences, 3 rd ed. Psychology Press.
[136] Hedges L.V. and Olkin I. (1985) Statistical Methods for Meta Analysis, Orlando: Academic Press.
[137] Li D., Zhong C. and Zhang L. (2010) Fuzzy cmeans clustering of partially missing data sets based
on statistical representation, In Proc. of Intl' Conf. Fuzzy Systems and Knowledge Discovery, pp.
460464, Dalian, China Aug 2010.
[138] TeShun C, Yen K.K., Liwei A., Pissinou N. and Makki K. (2007) Fuzzy belief pattern
classification of incomplete data Systems, pp. 535540, In Proc. Intl' Systems, Man and
Cybernetics, Oct 2007 Montreal CA.
232
[139] GarciaLaencina P. J., SanchoGomez J.L. and FigueirasVidal A.R. (2010) Pattern classification
with missing data: a review, Neural Computing & Applications, Vol. 19, pp. 263282.
[140] Hathaway R.J. and Bezdek J.C. (2001) Fuzzy cMeans Clustering of Incomplete Data, Systems,
Man and Cybernetics, Vol. 31, pp. 735744.
[141] Luo J.W., Yang T. and Wang Y. (2005) Missing Value Estimation For Microarray Data Based On
Fuzzy Cmeans Clustering, In Proc. Intl' Conf. on HighPerformance Computing, Changsha China,
Dec 2003.
[142] Mohammadi A. and Saraee M.H. (2008) Estimating Missing Value in Microarray Data Using
Fuzzy Clustering and Gene Ontology, In Proc. Intl' Conf. on Bioinformatics and Biomedicine, pp.
382385, Washington, DC, USA.
[143] Himmelspach L. and Conrad S. (2010) Clustering approaches for data with missing values:
Comparison and evaluation, In Proc. of Intl' conf. on Digital Information Management, pp. 1928,
Thunder Bay, ON.
[144] Wu K.L. (2010) Parameter Selections of Fuzzy CMeans Based on Robust Analysis, Engineering
and Technology, Vol. 65, 554557.
[145] Le T., Altman T. and Gardiner J.K. (2010) HIGEDA: a hierarchical geneset genetics based
algorithm for finding subtle motifs in biological sequences, Bioinformatics, Vol. 26, pp. 302309.
[146] Le T. and Altman T. (201 1) A new initialization method for the Fuzzy CMeans Algorithm using
Fuzzy Subtractive Clustering, In Proc. of the 2011 International Conference on Information and
Knowledge Engineering, pp. 144150, Las Vegas, Nevada, USA.
[147] Le T. and Gardiner J.K. (2011) A validation method for fuzzy clustering of gene expression data,
In Proc. of the 2011 International Conference on Bioinformatics & Computational Biology, Vol. I,
pp. 2329, Las Vegas, Nevada, USA.
[148] Bodjanova S. (2004) Linear intensification of probabilistic fuzzy partitions, Fuzzy Sets and
Systems, Vol. 141, pp. 319332.
[149] Chiu W.Y. and Couloigner I. (2006) Modified fuzzy cmeans classification technique for mapping
vague wetlands using Landsat ETM+ imagery, Hydrol. Process., Vol. 20, pp. 36233634.
233
[150] Chuang K.S., TZeng H.L., Chen S., Wu J. and Chen TJ. (2006) Fuzzy cmeans clustering with
spatial information for image segmentation, Computerized Medical Imaging and Graphics, Vol.
30, pp. 915.
[151] Genther H., Runkler T.A. and Glesner M. (1994) Defuzzification based on fuzzy clustering, Fuzzy
Systems, Vol. 3, 16461648.
[152] Roychowdhury S. and Pedrycz W. (2001) A Survey of Defuzzification Strategies, Intelligent
Systems, Vol. 16, pp. 679695.
[153] Yang Y., Huang S. and Rao N. (2008) An Automatic Hybrid Method for Retinal Blood Vessel
Extraction, Applied Mathematics and Computer Science, Vol. 18, pp. 399407.
[154] Le T., Altman T and Gardiner J.K. (2012) Probabilitybased imputation method for fuzzy cluster
analysis of gene expression microarray data, Proc. Intl' Conf. on Information TechnologyNew
Generations, pp. 4247, Las Vegas, NV, USA.
[155] Le T., Altman T. and Gardiner J.K. (2012) Densitybased imputation method for fuzzy cluster
analysis of gene expression microarray data, Proc. Intl' Conf. on Bioinformatics and
Computational Biology, pp. 190195, Las Vegas, NV, USA.
[156] Le T., Altman T. and Gardiner J.K. (2012) A Probability Based Defuzzification Method for Fuzzy
Cluster Partition, Proc. Intl' Conf. on Artificial Intelligence (WORLDCOMPICAI'12), Vol. 2, pp.
10381043, Las Vegas, NV, USA.
[157] Le T., Altman T. and Gardiner J.K. (2012) A Fuzzy Clustering Method using Genetic Algorithm
and Fuzzy Subtractive Clustering, Proc. Intl' Conf. on Information and Knowledge Engineering
(WORLDCOMPIKE'12), pp. 426432, Las Vegas, NV, USA.
[158] Luo J.W., Yang T. and Wang Y. (2006) Missing Value Estimation For Microarray Data Based On
Fuzzy Cmeans Clustering, Proc. Intl' Conf. HighPerformance Computing (HPCASIA'05), IEEE
Press, Feb. 2006, pp. 1116.
[159] Kim D.W, Lee K.Y., Lee K.H. and Lee D. (2007) Towards clustering of incomplete microarray
data without the use of imputation, Bioinformatics, Vol. 23, Jan. 2007, pp. 1071 13.
234
[160] Mohammadi A. and Saraee M.H (2008) Estimating Missing Value in Microarray Data Using
Fuzzy Clustering and Gene Ontology, Proc. Intl' Conf. Bioinformatics and Biomedicine
(BIBM'08), IEEE Press, Nov. 2008 pp. 382385.
[161] Mehdizadeh E, SadiNezhad S And TavakkoliMoghaddam R (2008) Optimization of Fuzzy
Clustering Criteria by a Hybrid Pso and Fuzzy CMeans Clustering Algorithm, Iranian Journal of
Fuzzy Systems, Vol. 5, pp. 114.
[162] Haider A., Pramanik S., and Kar A. (2011) Dynamic Image Segmentation using Fuzzy CMeans
based Genetic Algorithm, International Journal of Computer Applications, Vol. 28, pp. 1520.
[163] Ghosh A., Mishra N. S., and Ghosh S. (201 1) Fuzzy clustering algorithms for unsupervised change
detection in remote sensing images, Information Sciences, Vol. 181, pp. 699715.
[164] Lianjiang Z., Shouning Q., and Tao D. (2010) Adaptive fuzzy clustering based on genetic
algorithm, In Proc. of 2nd conference on advanced computer control, Shenyang China, pp. 7982.
[165] Lin T.C., Huang H.C., Liao B.Y., and Pan J.S. (2007) An Optimized Approach on Applying
Genetic Algorithm to Adaptive Cluster Validity Index, International Journal of Computer Sciences
and Engineering Systems, Vol. 1, pp. 253257.
[166] Liu Y. and Zhang Y. (2007) Optimizing Parameters of Fuzzy cMeans Clustering Algorithm, In
Proc. of the 4th conference on Fuzzy Systems and Knowledge Discovery (FSKD '07), Vol. 1,
2007, pp. 633638.
[167] Sturgeon X., Le T. and Gardiner J.K. (2010) Pathways to Intellectual Disability in Down
Syndrome, Tenth Annual Coleman Institute Conference (The Coleman Institute for Cognitive
Disabilities, University of Colorado), Westminster, Colorado USA, 10/2010.
[168] Sturgeon X., Le T., Ahmed M. and Gardiner J.K. (2012) Pathways to cognitive deficits in Down
syndrome, Progress in Brain Research  Elsevier, Vol. 197, pp. 73100.
[169] Le T. and Altman T. (in preparation) RPPR: a rank product statistics with page rank algorithm
combination method for expression profiles selection using gene expression signature.
[170] Le T., Leach S. and Gardiner J.K. (in preparation) A new approach of Connectivity MAP to
discover genedrug association using gene expression microarray data.
235
[171] Gardiner J.K. (2010) Molecular basis of pharmacotherapies for cognition in Down syndrome,
Trends Pharmacol Sci., Vol. 31, pp. 31: 6673.
[172] Dubois D. and Prade M. H. (1988) Possibility theory: an approach to computerized processing of
uncertainty, Plenum Press.
[173] Dubois D. and Prade M. H. (1980) Fuzzy sets and systems: Theory and applications, Mathematics
in Science and Engineering, Accademic Press.
[174] Wolkenhauer O. (1998) Possibility theory with applications to data analysis, John Wiley & Sons
Inc.
[175] Chu C.Y. and Rana T.M. (2006) Translation Repression in Human Cells by MicroRNAInduced
Gene Silencing Requires RCK/p54, PLoS Biololy, Vol. 4, pp. 1 1221 136.
[176] Velusamy K. and Manavalan R. (2011) Performance Analysis of Unsupervised Classification
Based on Optimization, Int'l Journal of Computer Applications, Vol. 42, pp. 2227.
[177] Rana S., Jasola S. and Kumar R. (2011) A review on particle swarm optimization algorithms and
their applications to data clustering, Artif Intell Rev, Vol. 35, pp. 21 1222.
[178] Xie J., Li K.C. and Bina M. (2004) A Bayesian Insertion/Deletion Algorithm for Distant Protein
Motif Searching via Entropy Filtering, Journal of the American Statistical Association, Vol. 99, pp.
409420.
[179] Klir G.J. and Yuan B. (1995) Fuzzy Sets and Fuzzy Logic: Theory and Applications, Prentice Hall.
[180] Le T, Altman T., Vu L. and Gardiner J.K. (in preparation) A cluster validation method using GO
for gene expression data analysis.
[181] Ashburner M., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., et al. (2000) Gene
ontology: tool for the unification of biology. The Gene Ontology Consortium, Nat Genet, Vol.
25(1), pp. 2529.
[182] Tari L., Baral C. and Kim S. (2009) Fuzzy cmeans clustering with prior biological knowledge, J.
Biomed Inform., Vol. 42, pp. 7481.
236
[183] Xu T., Du L.F., and Zhou Y. (2008) Evaluation of GObased functional similarity measures using
S. cerevisiae protein interaction and expression profile data, BMC Bioinformatics, Vol. 9, pp. 472
481.
[184] Wang J.Z., Du Z., Payattakool R., Yu P.S and Chen C.F. (2007) A new method to measure the
semantic similarity of GO terms, Bioinformatics, Vol. 23, pp. 12741281.
[185] Yu G., Li F., Qin Y. et al. (2010) GOSemSim: an r package for measuring semantic similarity
among GO terms and gene products, Bioinformatics, Vol. 26, pp. 976983.
[186] Krucker T., Siggins G.R. and Halpain S. (2000) Dynamic actin filaments are required for stable
longterm potentiation (LTP) in area CA1 of the hippocampus, PNAS, Vol. 97, pp. 68566861
237